Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 110]
cs.CV [Total: 266]
cs.AI [Total: 66]
cs.SD [Total: 11]
cs.LG [Total: 221]
cs.MA [Total: 10]
cs.MM [Total: 4]
eess.AS [Total: 6]
eess.IV [Total: 16]

cs.CL

[1] Open-Source Multimodal Moxin Models with Moxin-VLM and Moxin-VLA

Pu Zhao, Xuan Shen, Zhenglun Kong, Yixin Shen, Sung-En Chang, Arash Akbari, Timothy Rupprecht, Lei Lu, Enfu Nan, Changdi Yang, Yumei He, Weiyan Shi, Xingchen Xu, Yu Huang, Wei Jiang, Wei Wang, Yue Chen, Yong He, Yanzhi Wang

Main category: cs.CL

TL;DR: Moxin 7B is a fully open-source LLM that goes beyond just sharing model weights to provide complete transparency in training, datasets, and implementation details, with three specialized variants for vision-language, vision-language-action, and Chinese capabilities.

Details

Motivation: While proprietary LLMs like GPT-4 have gained attention and open-source models like LLaMA have increased LLM popularity, there's a need for truly transparent open-source models that foster inclusive collaboration and sustain a healthy open-source ecosystem beyond just sharing model weights.

Method: Developed Moxin 7B according to the Model Openness Framework with complete transparency in training, datasets, and implementation details. Created three specialized variants: Moxin-VLM for vision-language tasks, Moxin-VLA for vision-language-action tasks, and Moxin-Chinese for Chinese language capabilities.

Result: The models achieve superior performance in various evaluations. All training uses open-source frameworks and open data, with full release of models, data, and code to enable community access and collaboration.

Conclusion: Moxin represents a significant step toward fully transparent open-source LLMs that can sustain a healthy open-source ecosystem by providing complete access to training processes, data, and implementation details, enabling more inclusive research collaboration.

Abstract: Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA and Mistral, have made great contributions to the ever-increasing popularity of LLMs due to the ease to customize and deploy the models across diverse applications. Moxin 7B is introduced as a fully open-source LLM developed in accordance with the Model Openness Framework, which moves beyond the simple sharing of model weights to embrace complete transparency in training, datasets, and implementation detail, thus fostering a more inclusive and collaborative research environment that can sustain a healthy open-source ecosystem. To further equip Moxin with various capabilities in different tasks, we develop three variants based on Moxin, including Moxin-VLM, Moxin-VLA, and Moxin-Chinese, which target the vision-language, vision-language-action, and Chinese capabilities, respectively. Experiments show that our models achieve superior performance in various evaluations. We adopt open-source framework and open data for the training. We release our models, along with the available data and code to derive these models.

[2] Hierarchical Geometry of Cognitive States in Transformer Embedding Spaces

Sophie Zhao

Main category: cs.CL

TL;DR: Transformer sentence embeddings encode hierarchical cognitive structure aligned with human-annotated psychological attributes, decodable via linear/nonlinear probes beyond surface word statistics.

Details

Motivation: While transformer language models show rich geometric structure in embeddings, it's unclear if they encode higher-level cognitive organization aligned with human-interpretable psychological attributes.

Method: Created dataset of 480 sentences annotated with continuous energy scores and discrete tier labels across 7 cognitive categories. Used fixed embeddings from multiple transformer models, evaluated recoverability via linear and shallow nonlinear probes, compared to TF-IDF baselines, and conducted nonparametric permutation tests.

Result: Both continuous scores and tier labels were reliably decodable across models, with nonlinear probes outperforming linear ones. TF-IDF baselines performed worse, indicating structure beyond surface statistics. UMAP visualizations showed smooth gradients and adjacent-tier confusions in embedding space.

Conclusion: Transformer embedding spaces exhibit hierarchical geometric organization aligned with human-defined cognitive attributes, though this doesn’t imply internal awareness or phenomenology.

Abstract: Recent work has shown that transformer-based language models learn rich geometric structure in their embedding spaces, yet the presence of higher-level cognitive organization within these representations remains underexplored. In this work, we investigate whether sentence embeddings encode a graded, hierarchical structure aligned with human-interpretable cognitive or psychological attributes. We construct a dataset of 480 natural-language sentences annotated with continuous ordinal energy scores and discrete tier labels spanning seven ordered cognitive categories. Using fixed sentence embeddings from multiple transformer models, we evaluate the recoverability of these annotations via linear and shallow nonlinear probes. Across models, both continuous scores and tier labels are reliably decodable, with shallow nonlinear probes providing consistent performance gains over linear probes. Lexical TF-IDF baselines perform substantially worse, indicating that the observed structure is not attributable to surface word statistics alone. Nonparametric permutation tests further confirm that probe performance exceeds chance under label-randomization nulls. Qualitative analyses using UMAP visualizations and confusion matrices reveal smooth low-to-high gradients and predominantly adjacent-tier confusions in embedding space. Taken together, these results provide evidence that transformer embedding spaces exhibit a hierarchical geometric organization aligned with human-defined cognitive attributes, while remaining agnostic to claims of internal awareness or phenomenology.

[3] SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents

Shaofei Cai, Yulei Qin, Haojia Lin, Zihan Xu, Gang Li, Yuchen Shi, Zongyi Li, Yong Mao, Siqi Cai, Xiaoyu Tan, Yitao Liang, Ke Li, Xing Sun

Main category: cs.CL

TL;DR: SmartSnap introduces self-verifying agents that proactively collect snapshot evidence during task execution, enabling scalable agentic RL by reducing verification costs and improving reliability.

Details

Motivation: Current agentic RL faces scalability issues due to expensive and unreliable post-hoc task verification that processes verbose, noisy interaction histories, creating prohibitive costs and low reliability.

Method: Proposes Self-Verifying Agents with dual missions: complete tasks AND prove accomplishment using curated snapshot evidence. Agents follow 3C Principles (Completeness, Conciseness, Creativity) to collect minimal, decisive snapshots during execution, which are then evaluated by LLM-as-a-Judge.

Result: Experiments on mobile tasks show SmartSnap enables scalable LLM agent training with performance gains up to 26.08% for 8B models and 16.66% for 30B models. Self-verifying agents achieve competitive performance against much larger models like DeepSeek V3.1 and Qwen3-235B-A22B.

Conclusion: SmartSnap represents a paradigm shift from passive post-hoc verification to proactive in-situ self-verification, creating efficient agents that synergize solution finding with evidence seeking, enabling scalable agentic RL development.

Abstract: Agentic reinforcement learning (RL) holds great promise for the development of autonomous agents under complex GUI tasks, but its scalability remains severely hampered by the verification of task completion. Existing task verification is treated as a passive, post-hoc process: a verifier (i.e., rule-based scoring script, reward or critic model, and LLM-as-a-Judge) analyzes the agent’s entire interaction trajectory to determine if the agent succeeds. Such processing of verbose context that contains irrelevant, noisy history poses challenges to the verification protocols and therefore leads to prohibitive cost and low reliability. To overcome this bottleneck, we propose SmartSnap, a paradigm shift from this passive, post-hoc verification to proactive, in-situ self-verification by the agent itself. We introduce the Self-Verifying Agent, a new type of agent designed with dual missions: to not only complete a task but also to prove its accomplishment with curated snapshot evidences. Guided by our proposed 3C Principles (Completeness, Conciseness, and Creativity), the agent leverages its accessibility to the online environment to perform self-verification on a minimal, decisive set of snapshots. Such evidences are provided as the sole materials for a general LLM-as-a-Judge verifier to determine their validity and relevance. Experiments on mobile tasks across model families and scales demonstrate that our SmartSnap paradigm allows training LLM-driven agents in a scalable manner, bringing performance gains up to 26.08% and 16.66% respectively to 8B and 30B models. The synergizing between solution finding and evidence seeking facilitates the cultivation of efficient, self-verifying agents with competitive performance against DeepSeek V3.1 and Qwen3-235B-A22B.

[4] The Syntax of qulk-clauses in Yemeni Ibbi Arabic: A Minimalist Approach

Zubaida Mohammed Albadani, Mohammed Q. Shormani

Main category: cs.CL

TL;DR: Analysis of qulk-clauses in Yemeni Ibbi Arabic within Minimalist Program, proposing they are biclausal structures where ‘qulk’ functions as clause-embedding predicate selecting CP complement.

Details

Motivation: To investigate the syntactic structure of qulk-clauses in Yemeni Ibbi Arabic and contribute to generative syntax theory, particularly Minimalist Program, by analyzing this dialect-specific construction.

Method: Applying core minimalist operations (Merge, Move, Agree, Spell-out) with layered syntactic analysis, including post-syntactic processes like Morphological Merger, to derive qulk-clauses.

Result: Qulk-clauses are analyzed as biclausal structures where ‘qulk’ functions as clause-embedding predicate selecting CP complement, accounting for dialect-specific features like bipartite negation, cliticization, and CP embedding.

Conclusion: The study contributes to generative syntax/minimalism, raises questions about extending analysis to ‘kil-k’ (you said), and provides insights into minimalism’s universality.

Abstract: This study investigates the syntax of qulk-clauses in Yemeni Ibbi Arabic (YIA) within the Minimalist Program. The construction qulk-clause, a morphologically fused form meaning ‘I said,’ introduces embedded declarative interrogative, and imperative clauses, often eithout complementizer. The central proposal of this paper is that qulk-clauses are biclausal structures in which qulk functions a clause-embedding predicate sec;ecting a dull CP complement. By applying core minimalist operations, viz., Merge, Move, Agree, and Spell-out, the study provides a layered syntactic analysis of qulk-clauses, for illustrating how their derivation proceeds through standard computational steps and post-syntactic processes such as Morphological Merger. The proposal also accounts for dialect-specific features like bipartite negation, cliticization, and CP embedding. The findings offer theoretical contributions to generative syntax, specifically minimalism. The study concludes raising theoretical questions concerning extending the analysis to the addressee-clause kil-k ‘you said’. It also provides insights into the possibility of the universality of minimalism.

[5] Towards Efficient Post-Training via Fourier-Driven Adapter Architectures

Donggyun Bae, Jongil Park

Main category: cs.CL

TL;DR: FAA (Fourier-Activated Adapter) is a parameter-efficient fine-tuning method that uses random Fourier features to decompose representations into frequency components for selective modulation during adaptation.

Details

Motivation: To develop a more effective parameter-efficient fine-tuning approach for large pre-trained language models that can selectively modulate semantic information through frequency-aware mechanisms while maintaining low computational overhead.

Method: Incorporates random Fourier features into lightweight adapter modules to decompose intermediate representations into complementary low- and high-frequency components, enabling frequency-aware modulation of semantic information with adaptive weighting mechanisms.

Result: Consistently achieves competitive or superior performance on GLUE, E2E NLG, and instruction-tuning benchmarks compared to existing parameter-efficient fine-tuning methods, while maintaining low computational and memory overhead.

Conclusion: FAA provides a robust and efficient approach for post-training large language models, with ablation studies confirming the effectiveness of its frequency-aware activation and adaptive weighting mechanisms.

Abstract: We propose a novel framework, termed Fourier-Activated Adapter (FAA), for parameter-efficient fine-tuning of large pre-trained language models. By incorporating random Fourier features into lightweight adapter modules, FAA decomposes intermediate representations into complementary low- and high-frequency components, enabling frequency-aware modulation of semantic information. This design allows the model to selectively emphasize informative frequency bands during adaptation while preserving the representational capacity of the frozen backbone. Extensive experiments on GLUE, E2E NLG, and instruction-tuning benchmarks demonstrate that FAA consistently achieves competitive or superior performance compared to existing parameter-efficient fine-tuning methods, while maintaining low computational and memory overhead. Ablation studies further verify the effectiveness of frequency-aware activation and adaptive weighting mechanisms, highlighting FAA as a robust and efficient approach for post-training large language models.

[6] LLM-Guided Exemplar Selection for Few-Shot Wearable-Sensor Human Activity Recognition

Elsen Ronando, Sozo Inoue

Main category: cs.CL

TL;DR: LLM-Guided Exemplar Selection framework improves few-shot Human Activity Recognition by combining semantic reasoning from LLMs with geometric and structural cues for better exemplar selection.

Details

Motivation: Current HAR methods rely on large labeled datasets and purely geometric exemplar selection, which fails to distinguish similar wearable sensor activities like walking, walking upstairs, and walking downstairs.

Method: Uses LLM-generated knowledge prior capturing feature importance, inter-class confusability, and exemplar budget multipliers to guide exemplar scoring and selection. Combines with margin-based validation cues, PageRank centrality, hubness penalization, and facility-location optimization.

Result: Achieves macro F1-score of 88.78% on UCI-HAR dataset under strict few-shot conditions, outperforming classical approaches like random sampling, herding, and k-center.

Conclusion: LLM-derived semantic priors, when integrated with structural and geometric cues, provide a stronger foundation for selecting representative sensor exemplars in few-shot wearable-sensor HAR.

Abstract: In this paper, we propose an LLM-Guided Exemplar Selection framework to address a key limitation in state-of-the-art Human Activity Recognition (HAR) methods: their reliance on large labeled datasets and purely geometric exemplar selection, which often fail to distinguish similar weara-ble sensor activities such as walking, walking upstairs, and walking downstairs. Our method incorporates semantic reasoning via an LLM-generated knowledge prior that captures feature importance, inter-class confusability, and exemplar budget multipliers, and uses it to guide exemplar scoring and selection. These priors are combined with margin-based validation cues, PageRank centrality, hubness penalization, and facility-location optimization to obtain a compact and informative set of exemplars. Evaluated on the UCI-HAR dataset under strict few-shot conditions, the framework achieves a macro F1-score of 88.78%, outperforming classical approaches such as random sampling, herding, and $k$-center. The results show that LLM-derived semantic priors, when integrated with structural and geometric cues, provide a stronger foundation for selecting representative sensor exemplars in few-shot wearable-sensor HAR.

[7] Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing

Jeongsoo Choi, Jaehun Kim, Joon Son Chung

Main category: cs.CL

TL;DR: Cross-lingual dubbing system that translates speech while preserving duration, speaker identity, and speaking speed using discrete diffusion and flow matching models.

Details

Motivation: Existing speech translation approaches overlook transfer of speech patterns, causing mismatches with source speech and limiting suitability for dubbing applications.

Method: Discrete diffusion-based speech-to-unit translation with explicit duration control, followed by speech synthesis using conditional flow matching model. Includes unit-based speed adaptation mechanism for rate consistency.

Result: System generates natural and fluent translations aligned with original speech’s duration and speaking pace while achieving competitive translation performance.

Conclusion: Proposed framework enables effective cross-lingual dubbing by preserving key speech characteristics without relying on text, with code publicly available.

Abstract: This paper introduces a cross-lingual dubbing system that translates speech from one language to another while preserving key characteristics such as duration, speaker identity, and speaking speed. Despite the strong translation quality of existing speech translation approaches, they often overlook the transfer of speech patterns, leading to mismatches with source speech and limiting their suitability for dubbing applications. To address this, we propose a discrete diffusion-based speech-to-unit translation model with explicit duration control, enabling time-aligned translation. We then synthesize speech based on the translated units and source speaker’s identity using a conditional flow matching model. Additionally, we introduce a unit-based speed adaptation mechanism that guides the translation model to produce speech at a rate consistent with the source, without relying on any text. Extensive experiments demonstrate that our framework generates natural and fluent translations that align with the original speech’s duration and speaking pace, while achieving competitive translation performance. The code is available at https://github.com/kaistmm/Dub-S2ST.

[8] Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models

Yu-Xiang Lin, Cheng-Han Chiang, Hung-yi Lee

Main category: cs.CL

TL;DR: SLMs suffer from “style amnesia” - they cannot maintain instructed speaking styles (emotion, accent, volume, speed) across multi-turn conversations, despite being able to recall the instructions when asked.

Details

Motivation: To investigate whether spoken language models can consistently maintain paralinguistic speaking styles (emotion, accent, volume, speaking speed) throughout multi-turn conversations when instructed to do so, revealing a fundamental limitation in current SLM capabilities.

Method: Evaluated three proprietary and two open-source SLMs by instructing them to speak in specific styles at the beginning of conversations and testing their ability to maintain these styles across multiple turns. Also tested style instruction recall and experimented with various prompting strategies including system vs user message placement.

Result: No evaluated SLM could maintain consistent speaking styles across conversations (style amnesia). Models could recall style instructions when asked but failed to express them. Explicit recall requests partially mitigated the issue. SLMs struggled more with system message instructions than user messages, contrary to intended system prompt functionality.

Conclusion: Current SLMs have a fundamental limitation in maintaining paralinguistic speaking styles across multi-turn conversations, revealing a gap between instruction recall and actual expression. This “style amnesia” problem requires architectural or training improvements for consistent style maintenance in conversational AI.

Abstract: In this paper, we show that when spoken language models (SLMs) are instructed to speak in a specific speaking style at the beginning of a multi-turn conversation, they cannot maintain the required speaking styles after several turns of interaction; we refer to this as the style amnesia of SLMs. We focus on paralinguistic speaking styles, including emotion, accent, volume, and speaking speed. We evaluate three proprietary and two open-source SLMs, demonstrating that none of these models can maintain a consistent speaking style when instructed to do so. We further show that when SLMs are asked to recall the style instruction in later turns, they can recall the style instruction, but they fail to express it throughout the conversation. We also show that explicitly asking the model to recall the style instruction can partially mitigate style amnesia. In addition, we examine various prompting strategies and find that SLMs struggle to follow the required style when the instruction is placed in system messages rather than user messages, which contradicts the intended function of system prompts.

[9] Hallucination Detection and Evaluation of Large Language Model

Chenggong Zhang, Haopeng Wang

Main category: cs.CL

TL;DR: HHEM framework reduces hallucination evaluation time from 8 hours to 10 minutes while maintaining high accuracy (82.2%) through lightweight classification, though struggles with localized hallucinations in summarization tasks.

Details

Motivation: LLM hallucinations generate misleading content that undermines trust. Existing evaluation methods like KnowHalu are computationally expensive with multi-stage verification, creating a need for more efficient detection approaches.

Method: Introduces Hughes Hallucination Evaluation Model (HHEM), a lightweight classification-based framework that operates independently of LLM-based judgments. Uses segment-based retrieval for localized hallucination detection in summarization tasks. Evaluates across QA and summarization tasks using TPR, TNR, and Accuracy metrics.

Result: HHEM reduces evaluation time from 8 hours to 10 minutes. HHEM with non-fabrication checking achieves highest accuracy (82.2%) and TPR (78.9%). Larger models (7B-9B parameters) show fewer hallucinations, while intermediate models exhibit higher instability. Segment-based retrieval improves localized hallucination detection.

Conclusion: Need for structured evaluation frameworks that balance computational efficiency with robust factual validation. HHEM provides efficient hallucination detection but requires improvements for localized hallucinations in summarization tasks.

Abstract: Hallucinations in Large Language Models (LLMs) pose a significant challenge, generating misleading or unverifiable content that undermines trust and reliability. Existing evaluation methods, such as KnowHalu, employ multi-stage verification but suffer from high computational costs. To address this, we integrate the Hughes Hallucination Evaluation Model (HHEM), a lightweight classification-based framework that operates independently of LLM-based judgments, significantly improving efficiency while maintaining high detection accuracy. We conduct a comparative analysis of hallucination detection methods across various LLMs, evaluating True Positive Rate (TPR), True Negative Rate (TNR), and Accuracy on question-answering (QA) and summarization tasks. Our results show that HHEM reduces evaluation time from 8 hours to 10 minutes, while HHEM with non-fabrication checking achieves the highest accuracy (82.2%) and TPR (78.9%). However, HHEM struggles with localized hallucinations in summarization tasks. To address this, we introduce segment-based retrieval, improving detection by verifying smaller text components. Additionally, our cumulative distribution function (CDF) analysis indicates that larger models (7B-9B parameters) generally exhibit fewer hallucinations, while intermediate-sized models show higher instability. These findings highlight the need for structured evaluation frameworks that balance computational efficiency with robust factual validation, enhancing the reliability of LLM-generated content.

[10] PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech

Deepak Babu Piskala

Main category: cs.CL

TL;DR: ProfASR-Bench is a professional speech recognition benchmark that reveals a “context-utilization gap” - current promptable ASR systems fail to effectively use available contextual information despite being designed to do so.

Details

Motivation: Existing ASR benchmarks don't adequately address professional settings with dense terminology, formal register variation, and critical entity errors. There's a need for evaluation that measures how well ASR systems can leverage contextual information in high-stakes domains.

Method: Created ProfASR-Bench, a professional-talk evaluation suite across finance, medicine, legal, and technology domains. Each example pairs natural-language prompts (domain cues/speaker profiles) with entity-rich target utterances. Tested Whisper and Qwen-Omni models under various context conditions (no-context, profile, domain+profile, oracle, adversarial).

Result: Found consistent pattern: lightweight textual context produces little to no change in average WER, even with oracle prompts. Adversarial prompts don’t reliably degrade performance. This reveals a “context-utilization gap” where current systems are promptable but underuse available side information.

Conclusion: ProfASR-Bench provides standardized evaluation with context ladder, entity/slice-aware reporting, and reproducible testbed for comparing fusion strategies. The context-utilization gap highlights the need for better context utilization in professional ASR applications.

Abstract: Automatic Speech Recognition (ASR) in professional settings faces challenges that existing benchmarks underplay: dense domain terminology, formal register variation, and near-zero tolerance for critical entity errors. We present ProfASR-Bench, a professional-talk evaluation suite for high-stakes applications across finance, medicine, legal, and technology. Each example pairs a natural-language prompt (domain cue and/or speaker profile) with an entity-rich target utterance, enabling controlled measurement of context-conditioned recognition. The corpus supports conventional ASR metrics alongside entity-aware scores and slice-wise reporting by accent and gender. Using representative families Whisper (encoder-decoder ASR) and Qwen-Omni (audio language models) under matched no-context, profile, domain+profile, oracle, and adversarial conditions, we find a consistent pattern: lightweight textual context produces little to no change in average word error rate (WER), even with oracle prompts, and adversarial prompts do not reliably degrade performance. We term this the context-utilization gap (CUG): current systems are nominally promptable yet underuse readily available side information. ProfASR-Bench provides a standardized context ladder, entity- and slice-aware reporting with confidence intervals, and a reproducible testbed for comparing fusion strategies across model families. Dataset: https://huggingface.co/datasets/prdeepakbabu/ProfASR-Bench Code: https://github.com/prdeepakbabu/ProfASR-Bench

[11] HiFi-RAG: Hierarchical Content Filtering and Two-Pass Generation for Open-Domain RAG

Cattalyya Nuengsigkapian

Main category: cs.CL

TL;DR: HiFi-RAG is a hierarchical filtering RAG system that uses multi-stage processing with Gemini models to improve retrieval relevance and answer quality, winning the MMU-RAGent NeurIPS 2025 competition.

Details

Motivation: Standard RAG systems struggle with irrelevant retrieved information and misalignment between generated answers and user intent, especially in open-domain settings.

Method: Multi-stage pipeline using Gemini 2.5 Flash for query formulation, hierarchical content filtering, and citation attribution, while reserving Gemini 2.5 Pro for final answer generation. Combines speed/cost efficiency with reasoning capabilities.

Result: Outperformed baseline on MMU-RAGent validation set: ROUGE-L 0.274 (+19.6%), DeBERTaScore 0.677 (+6.2%). On Test2025 (post-cutoff knowledge questions), outperformed parametric baseline by 57.4% in ROUGE-L and 14.9% in DeBERTaScore.

Conclusion: HiFi-RAG demonstrates that hierarchical filtering with strategic model allocation (Flash for filtering, Pro for generation) significantly improves RAG performance on both standard and post-cutoff knowledge tasks.

Abstract: Retrieval-Augmented Generation (RAG) in open-domain settings faces significant challenges regarding irrelevant information in retrieved documents and the alignment of generated answers with user intent. We present HiFi-RAG (Hierarchical Filtering RAG), the winning closed-source system in the Text-to-Text static evaluation of the MMU-RAGent NeurIPS 2025 Competition. Our approach moves beyond standard embedding-based retrieval via a multi-stage pipeline. We leverage the speed and cost-efficiency of Gemini 2.5 Flash (4-6x cheaper than Pro) for query formulation, hierarchical content filtering, and citation attribution, while reserving the reasoning capabilities of Gemini 2.5 Pro for final answer generation. On the MMU-RAGent validation set, our system outperformed the baseline, improving ROUGE-L to 0.274 (+19.6%) and DeBERTaScore to 0.677 (+6.2%). On Test2025, our custom dataset evaluating questions that require post-cutoff knowledge (post January 2025), HiFi-RAG outperforms the parametric baseline by 57.4% in ROUGE-L and 14.9% in DeBERTaScore.

[12] Exploring the Vertical-Domain Reasoning Capabilities of Large Language Models

Jie Zhou, Xin Chen, Jie Zhang, Zhe Li

Main category: cs.CL

TL;DR: This paper evaluates LLMs’ accounting reasoning capabilities, establishes evaluation criteria, tests GLM-series models and GPT-4, finds GPT-4 performs best but current models still need optimization for enterprise accounting applications.

Details

Motivation: LLMs are transforming various domains, but integrating them effectively into professional fields like accounting requires understanding their domain-specific reasoning capabilities. There's a need to redefine the relationship between LLMs and domain-specific applications to promote enterprise digital transformation.

Method: Introduces vertical-domain accounting reasoning concept and establishes evaluation criteria by analyzing GLM-series training data. Evaluates GLM-6B, GLM-130B, GLM-4, and GPT-4 on accounting reasoning tasks using different prompt engineering strategies.

Result: Different prompt strategies improve performance to varying degrees across models. GPT-4 achieves the strongest accounting reasoning capability among tested models. However, current LLMs still fall short of real-world enterprise accounting application requirements.

Conclusion: While GPT-4 shows the best accounting reasoning performance, further optimization is needed for deployment in enterprise-level accounting scenarios to fully realize LLMs’ potential value in this domain. The evaluation framework provides benchmarks for improving accounting reasoning.

Abstract: Large Language Models (LLMs) are reshaping learning paradigms, cognitive processes, and research methodologies across a wide range of domains. Integrating LLMs with professional fields and redefining the relationship between LLMs and domain-specific applications has become a critical challenge for promoting enterprise digital transformation and broader social development. To effectively integrate LLMs into the accounting domain, it is essential to understand their domain-specific reasoning capabilities. This study introduces the concept of vertical-domain accounting reasoning and establishes evaluation criteria by analyzing the training data characteristics of representative GLM-series models. These criteria provide a foundation for subsequent research on reasoning paradigms and offer benchmarks for improving accounting reasoning performance. Based on this framework, we evaluate several representative models, including GLM-6B, GLM-130B, GLM-4, and OpenAI GPT-4, on a set of accounting reasoning tasks. Experimental results show that different prompt engineering strategies lead to varying degrees of performance improvement across models, with GPT-4 achieving the strongest accounting reasoning capability. However, current LLMs still fall short of real-world application requirements. In particular, further optimization is needed for deployment in enterprise-level accounting scenarios to fully realize the potential value of LLMs in this domain.

[13] Fun-Audio-Chat Technical Report

Tongyi Fun Team, Qian Chen, Luyao Cheng, Chong Deng, Xiangang Li, Jiaqing Liu, Chao-Hong Tan, Wen Wang, Junhao Xu, Jieping Ye, Qinglin Zhang, Qiquan Zhang, Jingren Zhou

Main category: cs.CL

TL;DR: Fun-Audio-Chat is a Large Audio Language Model that addresses temporal resolution mismatch and catastrophic forgetting in joint speech-text models through dual-resolution processing and core-cocktail training, achieving competitive performance on audio tasks while retaining text LLM knowledge.

Details

Motivation: Existing joint speech-text models face critical challenges: temporal resolution mismatch between speech tokens (25Hz) and text tokens (~3Hz) dilutes semantic information, incurs high computational costs, and causes catastrophic forgetting of text LLM knowledge.

Method: 1. Dual-Resolution Speech Representations (DRSR): Shared LLM processes audio at efficient 5Hz via token grouping, while Speech Refined Head generates high-quality tokens at 25Hz. 2. Core-Cocktail Training: Two-stage fine-tuning with intermediate merging to mitigate catastrophic forgetting. 3. Multi-Task DPO Training to enhance robustness, audio understanding, instruction-following and voice empathy.

Result: Fun-Audio-Chat 8B and MoE 30B-A3B achieve competitive performance on Speech-to-Text and Speech-to-Speech tasks, ranking top among similar-scale models on Spoken QA benchmarks. They also achieve competitive to superior performance on Audio Understanding, Speech Function Calling, Instruction-Following and Voice Empathy.

Conclusion: Fun-Audio-Chat successfully addresses key limitations of existing joint speech-text models through innovative dual-resolution processing and training techniques, achieving strong audio capabilities while preserving text LLM knowledge, with open-source availability for research and development.

Abstract: Recent advancements in joint speech-text models show great potential for seamless voice interactions. However, existing models face critical challenges: temporal resolution mismatch between speech tokens (25Hz) and text tokens (~3Hz) dilutes semantic information, incurs high computational costs, and causes catastrophic forgetting of text LLM knowledge. We introduce Fun-Audio-Chat, a Large Audio Language Model addressing these limitations via two innovations from our previous work DrVoice. First, Dual-Resolution Speech Representations (DRSR): the Shared LLM processes audio at efficient 5Hz (via token grouping), while the Speech Refined Head generates high-quality tokens at 25Hz, balancing efficiency (~50% GPU reduction) and quality. Second, Core-Cocktail Training, a two-stage fine-tuning with intermediate merging that mitigates catastrophic forgetting. We then apply Multi-Task DPO Training to enhance robustness, audio understanding, instruction-following and voice empathy. This multi-stage post-training enables Fun-Audio-Chat to retain text LLM knowledge while gaining powerful audio understanding, reasoning, and generation. Unlike recent LALMs requiring large-scale audio-text pre-training, Fun-Audio-Chat leverages pre-trained models and extensive post-training. Fun-Audio-Chat 8B and MoE 30B-A3B achieve competitive performance on Speech-to-Text and Speech-to-Speech tasks, ranking top among similar-scale models on Spoken QA benchmarks. They also achieve competitive to superior performance on Audio Understanding, Speech Function Calling, Instruction-Following and Voice Empathy. We develop Fun-Audio-Chat-Duplex, a full-duplex variant with strong performance on Spoken QA and full-duplex interactions. We open-source Fun-Audio-Chat-8B with training and inference code, and provide an interactive demo, at https://github.com/FunAudioLLM/Fun-Audio-Chat .

[14] Constituency Structure over Eojeol in Korean Treebanks

Jungyeul Park, Chulwoo Park

Main category: cs.CL

TL;DR: The paper argues for using eojeol (word) units instead of morphemes as terminals in Korean constituency treebanks to better separate morphology from syntax and align with dependency resources.

Details

Motivation: Current Korean constituency treebanks use morphemes as terminals, which conflates word-internal morphology with phrase-level syntax and creates mismatches with eojeol-based dependency resources. There's a need for better representation that separates these linguistic levels.

Method: Proposes an eojeol-based constituency representation where morphological segmentation and fine-grained POS information are encoded in a separate, non-constituent layer. Performs comparative analysis of Sejong and Penn Korean treebanks under explicit normalization assumptions.

Result: Shows that Sejong and Penn Korean treebanks can be treated as representationally equivalent at the eojeol-based constituency level. Develops an eojeol-based annotation scheme that preserves interpretable constituency while supporting cross-treebank comparison and constituency-dependency conversion.

Conclusion: Eojeol-based constituency representation is preferable for Korean treebanks as it properly separates morphology from syntax, aligns with dependency resources, enables cross-treebank comparability, and supports constituency-dependency conversion while maintaining interpretable constituency structure.

Abstract: The design of Korean constituency treebanks raises a fundamental representational question concerning the choice of terminal units. Although Korean words are morphologically complex, treating morphemes as constituency terminals conflates word internal morphology with phrase level syntactic structure and creates mismatches with eojeol based dependency resources. This paper argues for an eojeol based constituency representation, with morphological segmentation and fine grained part of speech information encoded in a separate, non constituent layer. A comparative analysis shows that, under explicit normalization assumptions, the Sejong and Penn Korean treebanks can be treated as representationally equivalent at the eojeol based constituency level. Building on this result, we outline an eojeol based annotation scheme that preserves interpretable constituency and supports cross treebank comparison and constituency dependency conversion.

[15] ManchuTTS: Towards High-Quality Manchu Speech Synthesis via Flow Matching and Hierarchical Text Representation

Suhua Wang, Zifan Wang, Xiaoxin Sun, D. J. Wang, Zhanbo Liu, Xin Li

Main category: cs.CL

TL;DR: ManchuTTS: A novel TTS system for endangered Manchu language using hierarchical text representation and cross-modal attention to handle agglutination, achieving high-quality synthesis with limited data.

Details

Motivation: Manchu is an endangered language with severe data scarcity and strong phonological agglutination, creating unique challenges for speech synthesis that existing methods cannot adequately address.

Method: Three-tier text representation (phoneme, syllable, prosodic) with cross-modal hierarchical attention; deep convolutional networks integrated with flow-matching Transformer for non-autoregressive generation; hierarchical contrastive loss; data augmentation for low-resource setting.

Result: Achieves MOS of 4.52 using only 5.2-hour training subset; outperforms all baselines; hierarchical guidance improves agglutinative word pronunciation accuracy by 31% and prosodic naturalness by 27%.

Conclusion: ManchuTTS effectively addresses Manchu’s unique linguistic challenges through hierarchical modeling and data strategies, providing a viable solution for low-resource agglutinative language synthesis.

Abstract: As an endangered language, Manchu presents unique challenges for speech synthesis, including severe data scarcity and strong phonological agglutination. This paper proposes ManchuTTS(Manchu Text to Speech), a novel approach tailored to Manchu’s linguistic characteristics. To handle agglutination, this method designs a three-tier text representation (phoneme, syllable, prosodic) and a cross-modal hierarchical attention mechanism for multi-granular alignment. The synthesis model integrates deep convolutional networks with a flow-matching Transformer, enabling efficient, non-autoregressive generation. This method further introduce a hierarchical contrastive loss to guide structured acoustic-linguistic correspondence. To address low-resource constraints, This method construct the first Manchu TTS dataset and employ a data augmentation strategy. Experiments demonstrate that ManchuTTS attains a MOS of 4.52 using a 5.2-hour training subset derived from our full 6.24-hour annotated corpus, outperforming all baseline models by a notable margin. Ablations confirm hierarchical guidance improves agglutinative word pronunciation accuracy (AWPA) by 31% and prosodic naturalness by 27%.

[16] Learning When Not to Attend Globally

Xuan Luo, Kailai Zhang, Xifeng Yan

Main category: cs.CL

TL;DR: AHA (All-or-Here Attention) enables LLMs to dynamically switch between full attention and local sliding window attention using binary routers, reducing up to 93% of full attention operations without performance loss.

Details

Motivation: Inspired by human reading behavior where we focus on current pages and only flip back when needed, the paper aims to make LLM attention more efficient by reducing unnecessary global context access.

Method: Proposes All-or-Here Attention (AHA) with binary routers per attention head that dynamically toggle between full attention and local sliding window attention (256 tokens) for each token.

Result: With 256-token window size, up to 93% of original full attention operations can be replaced by sliding window attention without performance degradation. Analysis reveals long-tail distribution in context dependency.

Conclusion: Full attention is largely redundant; efficient inference only requires on-demand access to global context, similar to human reading patterns.

Abstract: When reading books, humans focus primarily on the current page, flipping back to recap prior context only when necessary. Similarly, we demonstrate that Large Language Models (LLMs) can learn to dynamically determine when to attend to global context. We propose All-or-Here Attention (AHA), which utilizes a binary router per attention head to dynamically toggle between full attention and local sliding window attention for each token. Our results indicate that with a window size of 256 tokens, up to 93% of the original full attention operations can be replaced by sliding window attention without performance loss. Furthermore, by evaluating AHA across various window sizes, we identify a long-tail distribution in context dependency, where the necessity for full attention decays rapidly as the local window expands. By decoupling local processing from global access, AHA reveals that full attention is largely redundant, and that efficient inference requires only on-demand access to the global context.

[17] Structured Prompting and LLM Ensembling for Multimodal Conversational Aspect-based Sentiment Analysis

Zhiqiang Gao, Shihao Gao, Zixing Zhang, Yihao Guo, Hongyu Chen, Jing Han

Main category: cs.CL

TL;DR: LLM-based structured prompting and ensemble methods for multimodal conversational sentiment analysis, achieving 47.38% on sentiment sextuple extraction and 74.12% on sentiment flipping detection.

Details

Motivation: Multimodal conversational sentiment analysis is crucial for building emotionally intelligent AI systems, but extracting comprehensive sentiment components and detecting dynamic sentiment shifts from multi-speaker dialogues is challenging.

Method: For Subtask-I: Designed structured prompting pipeline to guide LLMs to sequentially extract sentiment sextuple (holder, target, aspect, opinion, sentiment, rationale). For Subtask-II: Leveraged complementary strengths of three LLMs through ensembling to identify sentiment transitions and triggers.

Result: Achieved 47.38% average score on Subtask-I (sentiment sextuple extraction) and 74.12% exact match F1 on Subtask-II (sentiment flipping detection), demonstrating effectiveness of step-wise refinement and ensemble strategies.

Conclusion: Structured prompting with LLMs and ensemble methods are effective for complex multimodal sentiment analysis tasks, showing promise for building more emotionally intelligent AI systems that can understand nuanced conversational dynamics.

Abstract: Understanding sentiment in multimodal conversations is a complex yet crucial challenge toward building emotionally intelligent AI systems. The Multimodal Conversational Aspect-based Sentiment Analysis (MCABSA) Challenge invited participants to tackle two demanding subtasks: (1) extracting a comprehensive sentiment sextuple, including holder, target, aspect, opinion, sentiment, and rationale from multi-speaker dialogues, and (2) detecting sentiment flipping, which detects dynamic sentiment shifts and their underlying triggers. For Subtask-I, in the present paper, we designed a structured prompting pipeline that guided large language models (LLMs) to sequentially extract sentiment components with refined contextual understanding. For Subtask-II, we further leveraged the complementary strengths of three LLMs through ensembling to robustly identify sentiment transitions and their triggers. Our system achieved a 47.38% average score on Subtask-I and a 74.12% exact match F1 on Subtask-II, showing the effectiveness of step-wise refinement and ensemble strategies in rich, multimodal sentiment analysis tasks.

[18] Chain-of-thought Reviewing and Correction for Time Series Question Answering

Chen Su, Yuanhe Tian, Yan Song

Main category: cs.CL

TL;DR: T3LLM is a framework using three LLMs (worker, reviewer, student) for time series question answering with explicit correction mechanisms to handle numerical reasoning errors.

Details

Motivation: Existing LLM-based approaches for time series question answering adopt general NLP techniques and are prone to reasoning errors with complex numerical sequences. Time series data's inherent verifiability enables consistency checking between reasoning steps and original input.

Method: Proposes T3LLM framework with three LLMs: worker generates step-wise chains of thought under structured prompts, reviewer inspects reasoning to identify errors and provide corrective comments, and student learns from corrected CoT through fine-tuning to internalize multi-step reasoning and self-correction.

Result: Experiments on multiple real-world TSQA benchmarks demonstrate that T3LLM achieves state-of-the-art performance over strong LLM-based baselines.

Conclusion: The T3LLM framework effectively addresses reasoning errors in time series question answering by leveraging the verifiability of time series data through a collaborative three-LLM approach with explicit correction mechanisms.

Abstract: With the advancement of large language models (LLMs), diverse time series analysis tasks are reformulated as time series question answering (TSQA) through a unified natural language interface. However, existing LLM-based approaches largely adopt general natural language processing techniques and are prone to reasoning errors when handling complex numerical sequences. Different from purely textual tasks, time series data are inherently verifiable, enabling consistency checking between reasoning steps and the original input. Motivated by this property, we propose T3LLM, which performs multi-step reasoning with an explicit correction mechanism for time series question answering. The T3LLM framework consists of three LLMs, namely, a worker, a reviewer, and a student, that are responsible for generation, review, and reasoning learning, respectively. Within this framework, the worker generates step-wise chains of thought (CoT) under structured prompts, while the reviewer inspects the reasoning, identifies erroneous steps, and provides corrective comments. The collaboratively generated corrected CoT are used to fine-tune the student model, internalizing multi-step reasoning and self-correction into its parameters. Experiments on multiple real-world TSQA benchmarks demonstrate that T3LLM achieves state-of-the-art performance over strong LLM-based baselines.

[19] M2G-Eval: Enhancing and Evaluating Multi-granularity Multilingual Code Generation

Fanglin Xu, Wei Zhang, Jian Yang, Guo Chen, Aishan Liu, Zhoujun Li, Xianglong Liu, Bryan Dai

Main category: cs.CL

TL;DR: M2G-Eval is a multi-granularity, multilingual framework for evaluating code generation in LLMs across 4 structural levels (Class, Function, Block, Line) and 18 programming languages, revealing hierarchical difficulty patterns and cross-language transferability.

Details

Motivation: Existing benchmarks assess code generation at single structural granularity and limited languages, obscuring fine-grained capability variations across different code scopes and multilingual scenarios.

Method: Introduced M2G-Eval framework with 17K+ training tasks and 1,286 human-annotated test instances across 18 languages. Developed M2G-Eval-Coder models by training Qwen3-8B with supervised fine-tuning and Group Relative Policy Optimization.

Result: Evaluation of 30 models revealed: (1) difficulty hierarchy with Line-level easiest and Class-level most challenging; (2) widening performance gaps between full- and partial-granularity languages as complexity increases; (3) strong cross-language correlations suggesting transferable programming concepts.

Conclusion: M2G-Eval enables fine-grained diagnosis of code generation capabilities and highlights persistent challenges in synthesizing complex, long-form code, providing insights into model capabilities across different structural granularities and languages.

Abstract: The rapid advancement of code large language models (LLMs) has sparked significant research interest in systematically evaluating their code generation capabilities, yet existing benchmarks predominantly assess models at a single structural granularity and focus on limited programming languages, obscuring fine-grained capability variations across different code scopes and multilingual scenarios. We introduce M2G-Eval, a multi-granularity, multilingual framework for evaluating code generation in large language models (LLMs) across four levels: Class, Function, Block, and Line. Spanning 18 programming languages, M2G-Eval includes 17K+ training tasks and 1,286 human-annotated, contamination-controlled test instances. We develop M2G-Eval-Coder models by training Qwen3-8B with supervised fine-tuning and Group Relative Policy Optimization. Evaluating 30 models (28 state-of-the-art LLMs plus our two M2G-Eval-Coder variants) reveals three main findings: (1) an apparent difficulty hierarchy, with Line-level tasks easiest and Class-level most challenging; (2) widening performance gaps between full- and partial-granularity languages as task complexity increases; and (3) strong cross-language correlations, suggesting that models learn transferable programming concepts. M2G-Eval enables fine-grained diagnosis of code generation capabilities and highlights persistent challenges in synthesizing complex, long-form code.

[20] On the Role of Discreteness in Diffusion LLMs

Ziqi Jin, Bin Wang, Xiang Lin, Lidong Bing, Aixin Sun

Main category: cs.CL

TL;DR: This paper analyzes diffusion models for language generation, identifying key challenges in applying diffusion principles to text due to its discrete and structured nature, and proposes properties for better alignment between diffusion mechanics and language requirements.

Details

Motivation: Diffusion models have appealing properties for language generation (parallel decoding, iterative refinement), but text's discrete and highly structured nature makes direct application of diffusion principles challenging. The paper aims to bridge this gap by analyzing how diffusion mechanics align with language-specific requirements.

Method: The authors analyze diffusion language modeling from two perspectives: diffusion process and language modeling. They outline five essential properties that separate diffusion mechanics from language requirements, categorize existing approaches into continuous diffusion in embedding space and discrete diffusion over tokens, and analyze their trade-offs. They also examine recent large diffusion language models to identify central issues.

Result: The analysis reveals that existing approaches (continuous diffusion in embedding space and discrete diffusion over tokens) each satisfy only part of the five essential properties, reflecting a structural trade-off. Two central issues are identified: (1) uniform corruption doesn’t respect how information is distributed across positions, and (2) token-wise marginal training cannot capture multi-token dependencies during parallel decoding.

Conclusion: The findings motivate the development of diffusion processes that align more closely with text structure, encouraging future work toward more coherent diffusion language models that better address the identified issues of information distribution and multi-token dependencies.

Abstract: Diffusion models offer appealing properties for language generation, such as parallel decoding and iterative refinement, but the discrete and highly structured nature of text challenges the direct application of diffusion principles. In this paper, we revisit diffusion language modeling from the view of diffusion process and language modeling, and outline five properties that separate diffusion mechanics from language-specific requirements. We first categorize existing approaches into continuous diffusion in embedding space and discrete diffusion over tokens. We then show that each satisfies only part of the five essential properties and therefore reflects a structural trade-off. Through analyses of recent large diffusion language models, we identify two central issues: (i) uniform corruption does not respect how information is distributed across positions, and (ii) token-wise marginal training cannot capture multi-token dependencies during parallel decoding. These observations motivate diffusion processes that align more closely with the structure of text, and encourage future work toward more coherent diffusion language models.

[21] Evaluating GRPO and DPO for Faithful Chain-of-Thought Reasoning in LLMs

Hadi Mohammadi, Tamas Kozak, Anastasia Giachanou

Main category: cs.CL

TL;DR: GRPO outperforms DPO for improving CoT faithfulness in LLMs, especially in larger models like Qwen2.5-14B-Instruct, offering a promising approach for more transparent reasoning.

Details

Motivation: Chain-of-thought reasoning often fails to reflect actual model reasoning, producing misleading justifications that undermine reliability for safety supervision and alignment monitoring.

Method: Evaluated two optimization methods: Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO) for improving CoT faithfulness across different model sizes.

Result: GRPO achieves higher performance than DPO in larger models, with Qwen2.5-14B-Instruct performing best. Both show positive correlation between model size and performance, but GRPO shows greater potential for improving faithfulness metrics.

Conclusion: GRPO offers a promising direction for developing more transparent and trustworthy reasoning in LLMs, addressing the faithfulness limitations of current CoT methods.

Abstract: Chain-of-thought (CoT) reasoning has emerged as a powerful technique for improving the problem-solving capabilities of large language models (LLMs), particularly for tasks requiring multi-step reasoning. However, recent studies show that CoT explanations often fail to reflect the model’s actual reasoning process, as models may produce coherent yet misleading justifications or modify answers without acknowledging external cues. Such discrepancies undermine the reliability of CoT-based methods for safety supervision and alignment monitoring, as models can generate plausible but deceptive rationales for incorrect answers. To better understand this limitation, we evaluate two optimization methods, Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO), in their ability to improve CoT faithfulness. Our experiments show that GRPO achieves higher performance than DPO in larger models, with the Qwen2.5-14B-Instruct model attaining the best results across all evaluation metrics. Both approaches exhibit positive correlations between model size and performance, but GRPO shows greater potential for improving faithfulness metrics, albeit with less stable behavior at smaller scales. These results suggest that GRPO offers a promising direction for developing more transparent and trustworthy reasoning in LLMs.

[22] Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

Pere Martra

Main category: cs.CL

TL;DR: GLU-MLP width pruning using MAW criterion shows selective capability effects: factual knowledge degrades but instruction-following improves (+46-75%), with robust reasoning and improved truthfulness as knowledge decreases.

Details

Motivation: To challenge the assumption that pruning causes uniform degradation across all model capabilities, and to systematically characterize how structured width pruning of GLU-MLP layers affects different cognitive abilities in language models.

Method: Structured width pruning of GLU-MLP layers guided by Maximum Absolute Weight (MAW) criterion, evaluating seven expansion ratio configurations across comprehensive benchmarks (MMLU, GSM8K, IFEval, MUSR, TruthfulQA-MC2).

Result: Pruning creates systematic dichotomy: factual knowledge (MMLU) and perplexity degrade predictably, but instruction-following improves substantially (+46-75% in IFEval), reasoning remains robust, and truthfulness improves as knowledge degrades (r=-0.864 correlation). Energy consumption reduces up to 23% with batch processing benefits.

Conclusion: Expansion ratio is a critical architectural parameter that selectively modulates cognitive capabilities, not just a compression metric. MAW-guided pruning acts as a selective filter, reducing parametric knowledge while preserving/enhancing behavioral alignment, connecting previously distinct research areas.

Abstract: Structured width pruning of GLU-MLP layers, guided by the Maximum Absolute Weight (MAW) criterion, reveals a systematic dichotomy in how reducing the expansion ratio affects different model capabilities. While performance on tasks relying on parametric knowledge (e.g., MMLU, GSM8K) and perplexity metrics degrades predictably, instruction-following capabilities improve substantially (+46% to +75% in IFEval for Llama-3.2-1B and 3B models), and multi-step reasoning remains robust (MUSR). This pattern challenges the prevailing assumption that pruning induces uniform degradation. We evaluated seven expansion ratio configurations using comprehensive benchmarks assessing factual knowledge, mathematical reasoning, language comprehension, instruction-following, and truthfulness. Our analysis identifies the expansion ratio as a critical architectural parameter that selectively modulates cognitive capabilities, rather than merely serving as a compression metric. We provide the first systematic characterization of this selective preservation phenomenon. Notably, we document a robust inverse correlation (r = -0.864, p = 0.012 in Llama-3B) between factual knowledge capacity (MMLU) and truthfulness metrics (TruthfulQA-MC2): as knowledge degrades, the model’s ability to discriminate misconceptions improves consistently. This connects two previously distinct research areas, demonstrating that MAW-guided width pruning acts as a selective filter, reducing parametric knowledge while preserving or enhancing behavioral alignment. Additionally, we quantify context-dependent efficiency trade-offs: pruned configurations achieve up to 23% reduction in energy consumption (J/token) but incur penalties in single-request latency, whereas batch processing workloads benefit uniformly.

[23] Conformal Prediction Sets for Next-Token Prediction in Large Language Models: Balancing Coverage Guarantees with Set Efficiency

Yoshith Roy Kotla, Varshith Roy Kotla

Main category: cs.CL

TL;DR: VACP framework improves conformal prediction efficiency for LLMs by 197x through semantic masking and temperature adjustment, reducing prediction sets from 847 to 4.3 tokens while maintaining 89.7% coverage.

Details

Motivation: LLMs need reliable uncertainty quantification for high-stakes applications, but standard softmax probabilities are poorly calibrated, and naive conformal prediction produces uninformatively large prediction sets (hundreds of tokens).

Method: Vocabulary-Aware Conformal Prediction (VACP) uses semantic masking to reduce the effective prediction space and temperature-adjusted scoring, provably maintaining marginal coverage while dramatically improving efficiency.

Result: On Gemma-2B with SQUAD and WikiText benchmarks, VACP achieves 89.7% empirical coverage (target 90%) while reducing mean prediction set size from 847 tokens to 4.3 tokens - a 197x efficiency improvement.

Conclusion: VACP successfully addresses the coverage-efficiency tradeoff in conformal prediction for LLMs, making uncertainty quantification practical for large-vocabulary models through vocabulary reduction techniques.

Abstract: Deploying large language models (LLMs) in high-stakes domains requires rigorous uncertainty quantification, yet standard softmax probabilities are often poorly calibrated. We present a systematic study of Adaptive Prediction Sets (APS) applied to next-token prediction in transformer-based models with large vocabularies (greater than 250,000 tokens). Our central contribution is the identification of a coverage-efficiency tradeoff: while naive conformal prediction achieves valid coverage, it produces prediction sets of hundreds of tokens, rendering them uninformative. We propose Vocabulary-Aware Conformal Prediction (VACP), a framework that leverages semantic masking and temperature-adjusted scoring to reduce the effective prediction space while provably maintaining marginal coverage. Experiments on Gemma-2B using SQUAD and WikiText benchmarks demonstrate that VACP achieves 89.7 percent empirical coverage (90 percent target) while reducing the mean prediction set size from 847 tokens to 4.3 tokens – a 197x improvement in efficiency. We provide a theoretical analysis of vocabulary reduction and release our implementation for reproducibility.

[24] GHaLIB: A Multilingual Framework for Hope Speech Detection in Low-Resource Languages

Ahmed Abdullah, Sana Fatima, Haroon Mahmood

Main category: cs.CL

TL;DR: Multilingual hope speech detection framework focusing on Urdu using transformer models, achieving strong performance on PolyHope-M 2025 benchmark.

Details

Motivation: Hope speech is underrepresented in NLP, especially for low-resource languages like Urdu. Current research focuses mainly on English, limiting tools for positive online communication. Transformer models have been effective for hate/offensive speech but not sufficiently applied to hope speech across diverse languages.

Method: Multilingual framework using pretrained transformer models (XLM-RoBERTa, mBERT, EuroBERT, UrduBERT) with simple preprocessing to train classifiers for hope speech detection.

Result: Strong performance on PolyHope-M 2025 benchmark: 95.2% F1-score for Urdu binary classification and 65.2% for Urdu multi-class classification. Competitive results also achieved for Spanish, German, and English.

Conclusion: Existing multilingual models can be effectively implemented in low-resource environments to identify hope speech, contributing to more constructive digital discourse and addressing the resource gap for underrepresented languages.

Abstract: Hope speech has been relatively underrepresented in Natural Language Processing (NLP). Current studies are largely focused on English, which has resulted in a lack of resources for low-resource languages such as Urdu. As a result, the creation of tools that facilitate positive online communication remains limited. Although transformer-based architectures have proven to be effective in detecting hate and offensive speech, little has been done to apply them to hope speech or, more generally, to test them across a variety of linguistic settings. This paper presents a multilingual framework for hope speech detection with a focus on Urdu. Using pretrained transformer models such as XLM-RoBERTa, mBERT, EuroBERT, and UrduBERT, we apply simple preprocessing and train classifiers for improved results. Evaluations on the PolyHope-M 2025 benchmark demonstrate strong performance, achieving F1-scores of 95.2% for Urdu binary classification and 65.2% for Urdu multi-class classification, with similarly competitive results in Spanish, German, and English. These results highlight the possibility of implementing existing multilingual models in low-resource environments, thus making it easier to identify hope speech and helping to build a more constructive digital discourse.

[25] Nested Browser-Use Learning for Agentic Information Seeking

Baixuan Li, Jialong Wu, Wenbiao Yin, Kuan Li, Zhongwang Zhang, Huifeng Yin, Zhengwei Tao, Liwen Zhang, Pengjun Xie, Jingren Zhou, Yong Jiang

Main category: cs.CL

TL;DR: NestBrowse introduces a nested browser-action framework that decouples interaction control from page exploration, enabling more effective deep-web information acquisition for information-seeking agents.

Details

Motivation: Current information-seeking agents are limited to API-level snippet retrieval and URL-based page fetching, restricting access to richer information available through real browsing. Full browser interaction could unlock deeper capabilities but introduces complexity due to fine-grained control and verbose page content returns.

Method: NestBrowse proposes a minimal and complete browser-action framework with a nested structure that decouples interaction control from page exploration. This design simplifies agentic reasoning while enabling effective deep-web information acquisition.

Result: Empirical results on challenging deep information-seeking benchmarks demonstrate that NestBrowse offers clear practical benefits. Further in-depth analyses underscore its efficiency and flexibility.

Conclusion: NestBrowse bridges the gap between limited API-level retrieval and complex full browser interaction, providing a framework that enables information-seeking agents to effectively access deeper web information through simplified browser control.

Abstract: Information-seeking (IS) agents have achieved strong performance across a range of wide and deep search tasks, yet their tool use remains largely restricted to API-level snippet retrieval and URL-based page fetching, limiting access to the richer information available through real browsing. While full browser interaction could unlock deeper capabilities, its fine-grained control and verbose page content returns introduce substantial complexity for ReAct-style function-calling agents. To bridge this gap, we propose Nested Browser-Use Learning (NestBrowse), which introduces a minimal and complete browser-action framework that decouples interaction control from page exploration through a nested structure. This design simplifies agentic reasoning while enabling effective deep-web information acquisition. Empirical results on challenging deep IS benchmarks demonstrate that NestBrowse offers clear benefits in practice. Further in-depth analyses underscore its efficiency and flexibility.

[26] Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages

Anaelia Ovalle, Candace Ross, Sebastian Ruder, Adina Williams, Karen Ullrich, Mark Ibrahim, Levent Sagun

Main category: cs.CL

TL;DR: Multilingual LLMs show high task accuracy but their reasoning often fails to logically support conclusions, especially in non-Latin scripts where misalignment is 2x worse than Latin scripts.

Details

Motivation: To investigate whether LLM reasoning quality transfers across languages, as current multilingual evaluation practices may provide incomplete picture of model reasoning capabilities.

Method: Introduced human-validated framework to evaluate if reasoning traces logically support conclusions across languages. Analyzed 65k reasoning traces from GlobalMMLU questions across 6 languages and 6 frontier models. Developed error taxonomy through human annotation.

Result: Found critical blind spot: models achieve high task accuracy but reasoning often fails to support conclusions. Reasoning traces in non-Latin scripts show at least 2x more misalignment between reasoning and conclusions than Latin scripts. Failures stem primarily from evidential errors (unsupported claims, ambiguous facts) followed by illogical reasoning steps.

Conclusion: Current multilingual evaluation practices provide incomplete picture of model reasoning capabilities, highlighting need for reasoning-aware evaluation frameworks that assess logical alignment between reasoning and conclusions across languages.

Abstract: Large language models demonstrate strong reasoning capabilities through chain-of-thought prompting, but whether this reasoning quality transfers across languages remains underexplored. We introduce a human-validated framework to evaluate whether model-generated reasoning traces logically support their conclusions across languages. Analyzing 65k reasoning traces from GlobalMMLU questions across 6 languages and 6 frontier models, we uncover a critical blind spot: while models achieve high task accuracy, their reasoning can fail to support their conclusions. Reasoning traces in non-Latin scripts show at least twice as much misalignment between their reasoning and conclusions than those in Latin scripts. We develop an error taxonomy through human annotation to characterize these failures, finding they stem primarily from evidential errors (unsupported claims, ambiguous facts) followed by illogical reasoning steps. Our findings demonstrate that current multilingual evaluation practices provide an incomplete picture of model reasoning capabilities and highlight the need for reasoning-aware evaluation frameworks.

Sashank Chapala, Maksym Mironov, Songgaojun Deng

Main category: cs.CL

TL;DR: Minimal prompt wording can mitigate social desirability bias in LLM-based population sampling, with reformulated prompts being most effective.

Details

Motivation: LLMs used for "Silicon Sampling" exhibit Social Desirability Bias (SDB) toward socially acceptable answers, diverging from real human data, but existing studies on mitigating this bias are limited.

Method: Used ANES data with Llama-3.1 and GPT-4.1-mini models; tested four prompt-based mitigation methods: reformulated (neutral third-person), reverse-coded (semantic inversion), priming (encouraging analytics), and preamble (encouraging sincerity). Evaluated alignment using Jensen-Shannon Divergence with bootstrap confidence intervals.

Result: Reformulated prompts most effectively improved alignment by reducing concentration on socially acceptable answers and achieving distributions closer to ANES. Reverse-coding had mixed results, while Priming and Preamble encouraged response uniformity without systematic bias mitigation benefits.

Conclusion: Prompt-based framing controls can effectively mitigate inherent Social Desirability Bias in LLMs, providing a practical path toward more representative silicon samples for population simulation.

Abstract: Large Language Models (LLMs) are increasingly used to simulate population responses, a method known as ``Silicon Sampling’’. However, responses to socially sensitive questions frequently exhibit Social Desirability Bias (SDB), diverging from real human data toward socially acceptable answers. Existing studies on social desirability bias in LLM-based sampling remain limited. In this work, we investigate whether minimal, psychologically grounded prompt wording can mitigate this bias and improve alignment between silicon and human samples. We conducted a study using data from the American National Election Study (ANES) on three LLMs from two model families: the open-source Llama-3.1 series and GPT-4.1-mini. We first replicate a baseline silicon sampling study, confirming the persistent Social Desirability Bias. We then test four prompt-based mitigation methods: \emph{reformulated} (neutral, third-person phrasing), \emph{reverse-coded} (semantic inversion), and two meta-instructions, \emph{priming} and \emph{preamble}, respectively encouraging analytics and sincerity. Alignment with ANES is evaluated using Jensen-Shannon Divergence with bootstrap confidence intervals. Our results demonstrate that reformulated prompts most effectively improve alignment by reducing distribution concentration on socially acceptable answers and achieving distributions closer to ANES. Reverse-coding produced mixed results across eligible items, while the Priming and Preamble encouraged response uniformity and showed no systematic benefit for bias mitigation. Our findings validate the efficacy of prompt-based framing controls in mitigating inherent Social Desirability Bias in LLMs, providing a practical path toward more representative silicon samples.

[28] Data Augmentation for Classification of Negative Pregnancy Outcomes in Imbalanced Data

Md Badsha Biswas

Main category: cs.CL

TL;DR: This paper proposes using social media data (Twitter) to enhance pregnancy outcome research by building an NLP pipeline to identify women’s pregnancy experiences and classify outcomes as positive or negative.

Details

Motivation: Infant mortality remains a significant public health issue with birth defects as a leading cause. There's a need for more comprehensive research and intervention strategies for negative pregnancy outcomes (miscarriage, stillbirths, birth defects, premature birth). Current datasets are limited, and social media data offers a novel, publicly available resource to supplement existing research.

Method: The paper introduces an NLP pipeline to process social media data (Twitter) by: 1) Using robust preprocessing techniques to handle data imbalance, noise, and lack of structure; 2) Implementing data augmentation strategies; 3) Automatically identifying women sharing pregnancy experiences; 4) Classifying outcomes into positive cases (full gestation, normal birth weight) and negative cases (negative pregnancy outcomes).

Result: The study demonstrates the viability of social media data as an adjunctive resource for epidemiological investigations of pregnancy outcomes. It provides a framework for identifying pregnant cohorts and comparator groups, and offers potential applications for assessing causal impacts of interventions, treatments, or prenatal exposures on maternal and fetal health.

Conclusion: Social media data can effectively supplement traditional datasets for pregnancy outcome research. The proposed NLP pipeline enables automated identification and classification of pregnancy experiences, creating opportunities for larger-scale observational studies and causal impact assessments in maternal-fetal health research.

Abstract: Infant mortality remains a significant public health concern in the United States, with birth defects identified as a leading cause. Despite ongoing efforts to understand the causes of negative pregnancy outcomes like miscarriage, stillbirths, birth defects, and premature birth, there is still a need for more comprehensive research and strategies for intervention. This paper introduces a novel approach that uses publicly available social media data, especially from platforms like Twitter, to enhance current datasets for studying negative pregnancy outcomes through observational research. The inherent challenges in utilizing social media data, including imbalance, noise, and lack of structure, necessitate robust preprocessing techniques and data augmentation strategies. By constructing a natural language processing (NLP) pipeline, we aim to automatically identify women sharing their pregnancy experiences, categorizing them based on reported outcomes. Women reporting full gestation and normal birth weight will be classified as positive cases, while those reporting negative pregnancy outcomes will be identified as negative cases. Furthermore, this study offers potential applications in assessing the causal impact of specific interventions, treatments, or prenatal exposures on maternal and fetal health outcomes. Additionally, it provides a framework for future health studies involving pregnant cohorts and comparator groups. In a broader context, our research showcases the viability of social media data as an adjunctive resource in epidemiological investigations about pregnancy outcomes.

[29] WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference

Aiwei Liu, Minghua He, Shaoxun Zeng, Sijun Zhang, Linhao Zhang, Chuhan Wu, Wei Jia, Yuan Liu, Xiao Zhou, Jie Zhou

Main category: cs.CL

TL;DR: WeDLM is a diffusion decoding framework using causal attention for parallel generation, achieving up to 10x speedup over optimized AR engines like vLLM while maintaining quality.

Details

Motivation: Autoregressive (AR) generation in LLMs suffers from limited parallelism due to token-by-token decoding. While Diffusion Language Models (DLLMs) offer parallel decoding, they often fail to achieve practical speed gains because bidirectional attention breaks prefix KV caching, forcing repeated contextualization and undermining efficiency.

Method: WeDLM uses standard causal attention to make parallel generation prefix-cache friendly. It employs Topological Reordering to move observed tokens to the physical prefix while preserving logical positions, allowing masked positions to condition on all observed tokens with strict causal masking. A streaming decoding procedure continuously commits confident tokens into a growing left-to-right prefix while maintaining fixed parallel workload.

Result: WeDLM preserves the quality of strong AR backbones while delivering substantial speedups: approaching 3x on challenging reasoning benchmarks and up to 10x in low-entropy generation regimes. Comparisons against AR baselines served by vLLM under matched deployment settings show diffusion-style decoding can outperform optimized AR engines in practice.

Conclusion: WeDLM demonstrates that diffusion-style decoding with causal attention can achieve practical speed advantages over optimized AR engines while maintaining generation quality, making parallel generation prefix-cache friendly and avoiding the efficiency limitations of bidirectional attention in traditional DLLMs.

Abstract: Autoregressive (AR) generation is the standard decoding paradigm for Large Language Models (LLMs), but its token-by-token nature limits parallelism at inference time. Diffusion Language Models (DLLMs) offer parallel decoding by recovering multiple masked tokens per step; however, in practice they often fail to translate this parallelism into deployment speed gains over optimized AR engines (e.g., vLLM). A key reason is that many DLLMs rely on bidirectional attention, which breaks standard prefix KV caching and forces repeated contextualization, undermining efficiency. We propose WeDLM, a diffusion decoding framework built entirely on standard causal attention to make parallel generation prefix-cache friendly. The core idea is to let each masked position condition on all currently observed tokens while keeping a strict causal mask, achieved by Topological Reordering that moves observed tokens to the physical prefix while preserving their logical positions. Building on this property, we introduce a streaming decoding procedure that continuously commits confident tokens into a growing left-to-right prefix and maintains a fixed parallel workload, avoiding the stop-and-wait behavior common in block diffusion methods. Experiments show that WeDLM preserves the quality of strong AR backbones while delivering substantial speedups, approaching 3x on challenging reasoning benchmarks and up to 10x in low-entropy generation regimes; critically, our comparisons are against AR baselines served by vLLM under matched deployment settings, demonstrating that diffusion-style decoding can outperform an optimized AR engine in practice.

[30] Harnessing Large Language Models for Biomedical Named Entity Recognition

Jian Chen, Leilei Su, Cong Sun

Main category: cs.CL

TL;DR: BioSelectTune: A data-centric framework using Hybrid Superfiltering to create high-quality training data for biomedical NER, achieving SOTA performance with only 50% of curated data.

Details

Motivation: General-domain LLMs struggle with biomedical NER due to lack of domain knowledge and performance degradation from low-quality training data. Need for efficient fine-tuning that prioritizes data quality over quantity.

Method: Reformulates BioNER as structured JSON generation task. Uses Hybrid Superfiltering strategy - a weak-to-strong data curation method where a homologous weak model distills a compact, high-impact training dataset.

Result: Achieves state-of-the-art performance across multiple BioNER benchmarks. Model trained on only 50% of curated positive data surpasses fully-trained baseline and outperforms domain-specialized models like BioMedBERT.

Conclusion: BioSelectTune demonstrates that prioritizing data quality through intelligent curation enables highly efficient fine-tuning of LLMs for biomedical NER, achieving superior performance with significantly less training data.

Abstract: Background and Objective: Biomedical Named Entity Recognition (BioNER) is a foundational task in medical informatics, crucial for downstream applications like drug discovery and clinical trial matching. However, adapting general-domain Large Language Models (LLMs) to this task is often hampered by their lack of domain-specific knowledge and the performance degradation caused by low-quality training data. To address these challenges, we introduce BioSelectTune, a highly efficient, data-centric framework for fine-tuning LLMs that prioritizes data quality over quantity. Methods and Results: BioSelectTune reformulates BioNER as a structured JSON generation task and leverages our novel Hybrid Superfiltering strategy, a weak-to-strong data curation method that uses a homologous weak model to distill a compact, high-impact training dataset. Conclusions: Through extensive experiments, we demonstrate that BioSelectTune achieves state-of-the-art (SOTA) performance across multiple BioNER benchmarks. Notably, our model, trained on only 50% of the curated positive data, not only surpasses the fully-trained baseline but also outperforms powerful domain-specialized models like BioMedBERT.

Dongning Rao, Yunbiao Zeng, Zhihua Jiang, Jujian Lv

Main category: cs.CL

TL;DR: TEXT model combines MLLM-generated explanations with temporal alignment using Mamba and cross-attention, achieving state-of-the-art MSA performance across multiple datasets.

Details

Motivation: Existing MSA approaches underutilize explanations and temporal alignments between modalities, despite their importance for understanding subtle emotions in human-interaction applications.

Method: 1) Augments explanations via Multi-modal Large Language Models; 2) Aligns audio/video representations through temporality-oriented neural network block combining Mamba and temporal cross-attention; 3) Uses text-routed sparse mixture-of-experts with gate fusion.

Result: Achieves best performance across four datasets, outperforming three recent approaches and three MLLMs. Wins on at least 4 out of 6 metrics. Reduces MAE to 0.353 on CH-SIMS (13.5% improvement).

Conclusion: TEXT demonstrates the effectiveness of combining MLLM-generated explanations with temporal alignment for superior multi-modal sentiment analysis performance.

Abstract: Human-interaction-involved applications underscore the need for Multi-modal Sentiment Analysis (MSA). Although many approaches have been proposed to address the subtle emotions in different modalities, the power of explanations and temporal alignments is still underexplored. Thus, this paper proposes the Text-routed sparse mixture-of-Experts model with eXplanation and Temporal alignment for MSA (TEXT). TEXT first augments explanations for MSA via Multi-modal Large Language Models (MLLM), and then novelly aligns the epresentations of audio and video through a temporality-oriented neural network block. TEXT aligns different modalities with explanations and facilitates a new text-routed sparse mixture-of-experts with gate fusion. Our temporal alignment block merges the benefits of Mamba and temporal cross-attention. As a result, TEXT achieves the best performance cross four datasets among all tested models, including three recently proposed approaches and three MLLMs. TEXT wins on at least four metrics out of all six metrics. For example, TEXT decreases the mean absolute error to 0.353 on the CH-SIMS dataset, which signifies a 13.5% decrement compared with recently proposed approaches.

[32] Fake News Classification in Urdu: A Domain Adaptation Approach for a Low-Resource Language

Muhammad Zain Ali, Bernhard Pfahringer, Tony Smith

Main category: cs.CL

TL;DR: Domain adaptation before fine-tuning improves fake news detection in Urdu for XLM-R but shows mixed results for mBERT.

Details

Motivation: Urdu, as a low-resource language, has received limited attention in misinformation detection research, and multilingual models struggle with domain-specific terms in fake news classification.

Method: Used staged training approach with domain-adaptive pretraining on Urdu news corpus before fine-tuning for classification; evaluated XLM-RoBERTa and mBERT models on four Urdu fake news datasets.

Result: Domain-adapted XLM-R consistently outperformed its vanilla counterpart across all datasets, while domain-adapted mBERT showed mixed results.

Conclusion: Domain adaptation before fine-tuning can significantly improve fake news detection performance in low-resource languages like Urdu, particularly for XLM-R models, highlighting the importance of domain-specific pretraining.

Abstract: Misinformation on social media is a widely acknowledged issue, and researchers worldwide are actively engaged in its detection. However, low-resource languages such as Urdu have received limited attention in this domain. An obvious approach is to utilize a multilingual pretrained language model and fine-tune it for a downstream classification task, such as misinformation detection. However, these models struggle with domain-specific terms, leading to suboptimal performance. To address this, we investigate the effectiveness of domain adaptation before fine-tuning for fake news classification in Urdu, employing a staged training approach to optimize model generalization. We evaluate two widely used multilingual models, XLM-RoBERTa and mBERT, and apply domain-adaptive pretraining using a publicly available Urdu news corpus. Experiments on four publicly available Urdu fake news datasets show that domain-adapted XLM-R consistently outperforms its vanilla counterpart, while domain-adapted mBERT exhibits mixed results.

[33] CNSight: Evaluation of Clinical Note Segmentation Tools

Risha Surana, Adrian Law, Sunwoo Kim, Rishab Sridhar, Angxiao Han, Peiyu Hong

Main category: cs.CL

TL;DR: This paper evaluates different methods for segmenting clinical notes into sections, finding that large API-based models like GPT-5-mini perform best overall with 72.4 average F1 score.

Details

Motivation: Clinical notes are typically unstructured or semi-structured after extraction from EMR systems, making them difficult to use for secondary analysis and downstream applications. Reliable section boundary identification is crucial for structuring these notes since different sections (like history of present illness, medications, discharge instructions) provide distinct clinical contexts.

Method: The researchers evaluated three approaches: rule-based baselines, domain-specific transformer models, and large language models. They used a curated dataset of 1,000 notes from MIMIC-IV for clinical note segmentation, testing performance on both sentence-level and freetext segmentation tasks.

Result: Large API-based models achieved the best overall performance, with GPT-5-mini reaching a best average F1 score of 72.4 across both sentence-level and freetext segmentation. Lightweight baselines remained competitive on structured sentence-level tasks but performed poorly on unstructured freetext segmentation.

Conclusion: The results provide guidance for method selection in clinical note segmentation and lay the groundwork for downstream tasks such as information extraction, cohort identification, and automated summarization of clinical notes.

Abstract: Clinical notes are often stored in unstructured or semi-structured formats after extraction from electronic medical record (EMR) systems, which complicates their use for secondary analysis and downstream clinical applications. Reliable identification of section boundaries is a key step toward structuring these notes, as sections such as history of present illness, medications, and discharge instructions each provide distinct clinical contexts. In this work, we evaluate rule-based baselines, domain-specific transformer models, and large language models for clinical note segmentation using a curated dataset of 1,000 notes from MIMIC-IV. Our experiments show that large API-based models achieve the best overall performance, with GPT-5-mini reaching a best average F1 of 72.4 across sentence-level and freetext segmentation. Lightweight baselines remain competitive on structured sentence-level tasks but falter on unstructured freetext. Our results provide guidance for method selection and lay the groundwork for downstream tasks such as information extraction, cohort identification, and automated summarization.

[34] NepEMO: A Multi-Label Emotion and Sentiment Analysis on Nepali Reddit with Linguistic Insights and Temporal Trends

Sameer Sitoula, Tej Bahadur Shahi, Laxmi Prasad Bhatt, Anisha Pokhrel, Arjun Neupane

Main category: cs.CL

TL;DR: NepEMO: A novel multi-label emotion and sentiment classification dataset for Nepali Reddit posts with 4,462 annotated entries covering five emotions and three sentiment classes.

Details

Motivation: Social media platforms like Reddit provide unique spaces for anonymous expression of sensitive issues, but there's a lack of datasets for analyzing emotions and sentiment in Nepali social media content, particularly during challenging events and joyful occasions.

Method: Created NepEMO dataset with 4,462 manually annotated Reddit posts (Jan 2019-Jun 2025) in English, Romanized Nepali, and Devanagari script. Performed linguistic analysis including emotion trends, emotion co-occurrence, sentiment n-grams, and topic modeling using LDA and TF-IDF. Compared traditional ML, DL, and transformer models for classification tasks.

Result: Transformer models consistently outperformed both traditional machine learning and deep learning models for both multi-label emotion classification and sentiment classification tasks.

Conclusion: The NepEMO dataset enables emotion and sentiment analysis in Nepali social media, with transformer models proving most effective for these classification tasks, providing valuable insights into emotional expression patterns in online Nepali communities.

Abstract: Social media (SM) platforms (e.g. Facebook, Twitter, and Reddit) are increasingly leveraged to share opinions and emotions, specifically during challenging events, such as natural disasters, pandemics, and political elections, and joyful occasions like festivals and celebrations. Among the SM platforms, Reddit provides a unique space for its users to anonymously express their experiences and thoughts on sensitive issues such as health and daily life. In this work, we present a novel dataset, called NepEMO, for multi-label emotion (MLE) and sentiment classification (SC) on the Nepali subreddit post. We curate and build a manually annotated dataset of 4,462 posts (January 2019- June 2025) written in English, Romanised Nepali and Devanagari script for five emotions (fear, anger, sadness, joy, and depression) and three sentiment classes (positive, negative, and neutral). We perform a detailed analysis of posts to capture linguistic insights, including emotion trends, co-occurrence of emotions, sentiment-specific n-grams, and topic modelling using Latent Dirichlet Allocation and TF-IDF keyword extraction. Finally, we compare various traditional machine learning (ML), deep learning (DL), and transformer models for MLE and SC tasks. The result shows that transformer models consistently outperform the ML and DL models for both tasks.

[35] AutoForge: Automated Environment Synthesis for Agentic Reinforcement Learning

Shihao Cai, Runnan Fang, Jialong Wu, Baixuan Li, Xinyu Wang, Yong Jiang, Liangcai Su, Liwen Zhang, Wenbiao Yin, Zhen Zhang, Fuli Feng, Pengjun Xie, Xiaobin Wang

Main category: cs.CL

TL;DR: Automated pipeline for synthesizing challenging simulated environments and environment-level RL algorithm for training language agents more efficiently and stably.

Details

Motivation: Previous RL approaches for language agents used limited semi-automated environment synthesis or easy tasks, plus suffer from simulated user instability and environment heterogeneity, hindering effective agent training.

Method: Two key components: (1) unified automated pipeline for scalable synthesis of high-difficulty but easily verifiable task environments; (2) environment-level RL algorithm that mitigates user instability and performs advantage estimation at environment level.

Result: Comprehensive evaluations on agentic benchmarks (tau-bench, tau2-Bench, VitaBench) validate effectiveness, with in-depth analyses showing strong out-of-domain generalization.

Conclusion: The proposed approach enables more effective and stable reinforcement learning for language agents through automated environment synthesis and environment-level training algorithms.

Abstract: Conducting reinforcement learning (RL) in simulated environments offers a cost-effective and highly scalable way to enhance language-based agents. However, previous work has been limited to semi-automated environment synthesis or tasks lacking sufficient difficulty, offering little breadth or depth. In addition, the instability of simulated users integrated into these environments, along with the heterogeneity across simulated environments, poses further challenges for agentic RL. In this work, we propose: (1) a unified pipeline for automated and scalable synthesis of simulated environments associated with high-difficulty but easily verifiable tasks; and (2) an environment level RL algorithm that not only effectively mitigates user instability but also performs advantage estimation at the environment level, thereby improving training efficiency and stability. Comprehensive evaluations on agentic benchmarks, including tau-bench, tau2-Bench, and VitaBench, validate the effectiveness of our proposed method. Further in-depth analyses underscore its out-of-domain generalization.

[36] Diversity or Precision? A Deep Dive into Next Token Prediction

Haoyuan Wu, Hai Wang, Jiajia Wu, Jinxiang Ou, Keyao Wang, Weile Chen, Zihao Zheng, Bei Yu

Main category: cs.CL

TL;DR: The paper proposes a generalized pre-training objective that adapts RL principles to supervised learning to reshape LLM token distributions for better RL exploration, finding precision-oriented priors outperform high-entropy distributions.

Details

Motivation: The effectiveness of RL training for improving LLM reasoning depends critically on the exploration space defined by the pre-trained model's token-output distribution. Current cross-entropy loss may not provide optimal exploration potential for subsequent RL training.

Method: Proposes a generalized pre-training objective that frames next-token prediction as a stochastic decision process. Uses reward-shaping with positive reward scaling to control probability concentration on ground-truth tokens and rank-aware mechanism for asymmetric treatment of high/low-ranking negative tokens.

Result: Contrary to intuition that higher distribution entropy facilitates effective exploration, the study finds that imposing a precision-oriented prior yields a superior exploration space for RL, ultimately enhancing end-to-end reasoning performance.

Conclusion: The paper demonstrates that systematically reshaping pre-trained token-output distributions using RL-inspired objectives can provide more favorable exploration spaces for subsequent RL training, leading to improved reasoning capabilities in LLMs.

Abstract: Recent advancements have shown that reinforcement learning (RL) can substantially improve the reasoning abilities of large language models (LLMs). The effectiveness of such RL training, however, depends critically on the exploration space defined by the pre-trained model’s token-output distribution. In this paper, we revisit the standard cross-entropy loss, interpreting it as a specific instance of policy gradient optimization applied within a single-step episode. To systematically study how the pre-trained distribution shapes the exploration potential for subsequent RL, we propose a generalized pre-training objective that adapts on-policy RL principles to supervised learning. By framing next-token prediction as a stochastic decision process, we introduce a reward-shaping strategy that explicitly balances diversity and precision. Our method employs a positive reward scaling factor to control probability concentration on ground-truth tokens and a rank-aware mechanism that treats high-ranking and low-ranking negative tokens asymmetrically. This allows us to reshape the pre-trained token-output distribution and investigate how to provide a more favorable exploration space for RL, ultimately enhancing end-to-end reasoning performance. Contrary to the intuition that higher distribution entropy facilitates effective exploration, we find that imposing a precision-oriented prior yields a superior exploration space for RL.

[37] Prompt engineering does not universally improve Large Language Model performance across clinical decision-making tasks

Mengdi Chai, Ali R. Zomorrodi

Main category: cs.CL

TL;DR: LLMs show variable performance in clinical decision support tasks, with prompt engineering helping some tasks but hurting others, requiring tailored approaches for healthcare integration.

Details

Motivation: While LLMs show promise in medical knowledge assessments, their practical utility in real-world clinical decision-making remains underexplored, particularly across the full clinical reasoning workflow.

Method: Evaluated ChatGPT-4o, Gemini 1.5 Pro, and LIama 3.3 70B on 36 case studies across five clinical decision-making tasks under two temperature settings, then applied MedPrompt framework variations with targeted and random dynamic few-shot learning.

Result: Models showed high task variability: near-perfect accuracy in final diagnosis, poor performance in diagnostic testing, moderate in other tasks. Temperature effects varied by model. Prompt engineering improved lowest-performing tasks but was counterproductive for others. Targeted few-shot prompting didn’t consistently outperform random selection.

Conclusion: Prompt engineering impact is highly model and task-dependent, requiring tailored, context-aware strategies for integrating LLMs into healthcare rather than one-size-fits-all solutions.

Abstract: Large Language Models (LLMs) have demonstrated promise in medical knowledge assessments, yet their practical utility in real-world clinical decision-making remains underexplored. In this study, we evaluated the performance of three state-of-the-art LLMs-ChatGPT-4o, Gemini 1.5 Pro, and LIama 3.3 70B-in clinical decision support across the entire clinical reasoning workflow of a typical patient encounter. Using 36 case studies, we first assessed LLM’s out-of-the-box performance across five key sequential clinical decision-making tasks under two temperature settings (default vs. zero): differential diagnosis, essential immediate steps, relevant diagnostic testing, final diagnosis, and treatment recommendation. All models showed high variability by task, achieving near-perfect accuracy in final diagnosis, poor performance in relevant diagnostic testing, and moderate performance in remaining tasks. Furthermore, ChatGPT performed better under the zero temperature, whereas LIama showed stronger performance under the default temperature. Next, we assessed whether prompt engineering could enhance LLM performance by applying variations of the MedPrompt framework, incorporating targeted and random dynamic few-shot learning. The results demonstrate that prompt engineering is not a one-size-fit-all solution. While it significantly improved the performance on the task with lowest baseline accuracy (relevant diagnostic testing), it was counterproductive for others. Another key finding was that the targeted dynamic few-shot prompting did not consistently outperform random selection, indicating that the presumed benefits of closely matched examples may be counterbalanced by loss of broader contextual diversity. These findings suggest that the impact of prompt engineering is highly model and task-dependent, highlighting the need for tailored, context-aware strategies for integrating LLMs into healthcare.

[38] Improving Generalization in LLM Structured Pruning via Function-Aware Neuron Grouping

Tao Yu, Yongqi An, Kuan Zhu, Guibo Zhu, Ming Tang, Jinqiao Wang

Main category: cs.CL

TL;DR: FANG is a post-training pruning framework that groups neurons by function to reduce calibration bias, achieving SOTA results with 1.5-8.5% accuracy improvements over existing methods at 30-40% sparsity.

Details

Motivation: LLMs have high computational/storage costs, and existing post-training pruning methods suffer from limited generalization when calibration sets don't reflect pretraining data distribution, leading to calibration bias.

Method: Function-Aware Neuron Grouping (FANG) groups neurons by semantic context type, prunes each group independently with weighted importance estimation, preserves cross-context neurons, and adaptively allocates sparsity based on block complexity.

Result: FANG improves downstream accuracy while preserving language modeling performance, achieving SOTA results when combined with FLAP and OBC, outperforming them by 1.5-8.5% average accuracy at 30-40% sparsity.

Conclusion: FANG effectively addresses calibration bias in LLM pruning through function-aware neuron grouping, enabling better sparsity-performance trade-offs and improved generalization to downstream tasks.

Abstract: Large Language Models (LLMs) demonstrate impressive performance across natural language tasks but incur substantial computational and storage costs due to their scale. Post-training structured pruning offers an efficient solution. However, when few-shot calibration sets fail to adequately reflect the pretraining data distribution, existing methods exhibit limited generalization to downstream tasks. To address this issue, we propose Function-Aware Neuron Grouping (FANG), a post-training pruning framework that alleviates calibration bias by identifying and preserving neurons critical to specific function. FANG groups neurons with similar function based on the type of semantic context they process and prunes each group independently. During importance estimation within each group, tokens that strongly correlate with the functional role of the neuron group are given higher weighting. Additionally, FANG also preserves neurons that contribute across multiple context types. To achieve a better trade-off between sparsity and performance, it allocates sparsity to each block adaptively based on its functional complexity. Experiments show that FANG improves downstream accuracy while preserving language modeling performance. It achieves the state-of-the-art (SOTA) results when combined with FLAP and OBC, two representative pruning methods. Specifically, FANG outperforms FLAP and OBC by 1.5%–8.5% in average accuracy under 30% and 40% sparsity.

[39] LENS: LLM-Enabled Narrative Synthesis for Mental Health by Aligning Multimodal Sensing with Language Models

Wenxuan Xu, Arvind Pillai, Subigya Nepal, Amanda C Collins, Daniel M Mackin, Michael V Heinz, Tess Z Griffin, Nicholas C Jacobson, Andrew Campbell

Main category: cs.CL

TL;DR: LENS is a framework that aligns multimodal health sensor data with language models to generate clinically meaningful mental health narratives from time-series behavioral signals.

Details

Motivation: Current LLMs cannot natively process long-duration sensor streams, and there's a scarcity of paired sensor-text datasets for mental health applications, making it difficult to translate numerical time-series measurements into natural language for clinical assessment.

Method: LENS constructs a large-scale dataset by transforming Ecological Momentary Assessment (EMA) responses into natural-language descriptions (100k+ sensor-text QA pairs from 258 participants). It trains a patch-level encoder that projects raw sensor signals directly into an LLM’s representation space to enable native time-series integration.

Result: LENS outperforms strong baselines on standard NLP metrics and task-specific symptom-severity accuracy measures. A user study with 13 mental-health professionals indicates that LENS-produced narratives are comprehensive and clinically meaningful.

Conclusion: LENS advances LLMs as interfaces for health sensing, providing a scalable path toward models that can reason over raw behavioral signals and support downstream clinical decision-making for mental health assessment.

Abstract: Multimodal health sensing offers rich behavioral signals for assessing mental health, yet translating these numerical time-series measurements into natural language remains challenging. Current LLMs cannot natively ingest long-duration sensor streams, and paired sensor-text datasets are scarce. To address these challenges, we introduce LENS, a framework that aligns multimodal sensing data with language models to generate clinically grounded mental-health narratives. LENS first constructs a large-scale dataset by transforming Ecological Momentary Assessment (EMA) responses related to depression and anxiety symptoms into natural-language descriptions, yielding over 100,000 sensor-text QA pairs from 258 participants. To enable native time-series integration, we train a patch-level encoder that projects raw sensor signals directly into an LLM’s representation space. Our results show that LENS outperforms strong baselines on standard NLP metrics and task-specific measures of symptom-severity accuracy. A user study with 13 mental-health professionals further indicates that LENS-produced narratives are comprehensive and clinically meaningful. Ultimately, our approach advances LLMs as interfaces for health sensing, providing a scalable path toward models that can reason over raw behavioral signals and support downstream clinical decision-making.

[40] Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization

Kerem Zaman, Shashank Srivastava

Main category: cs.CL

TL;DR: The paper critiques the Biasing Features metric for labeling CoTs as unfaithful when they omit prompt-injected hints, arguing this confuses unfaithfulness with necessary incompleteness in compressing transformer computation into linear narratives.

Details

Motivation: To challenge the validity of the Biasing Features metric for evaluating Chain-of-Thought (CoT) faithfulness, showing it mislabels incomplete CoTs as unfaithful when they fail to verbalize all prompt-injected hints.

Method: Used multi-hop reasoning tasks with Llama-3 and Gemma-3 models, introduced a new faithful@k metric, analyzed token budget effects on hint verbalization, and applied Causal Mediation Analysis to trace non-verbalized hint influence.

Result: Many CoTs flagged as unfaithful by Biasing Features were judged faithful by other metrics (exceeding 50% in some models). Larger token budgets increased hint verbalization up to 90%, and Causal Mediation Analysis showed non-verbalized hints still causally mediate predictions through CoTs.

Conclusion: Researchers should avoid relying solely on hint-based evaluations and instead use a broader interpretability toolkit including causal mediation and corruption-based metrics, as apparent unfaithfulness often stems from token limits rather than actual unfaithfulness.

Abstract: Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with Llama-3 and Gemma-3, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time token budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT. We therefore caution against relying solely on hint-based evaluations and advocate a broader interpretability toolkit, including causal mediation and corruption-based metrics.

[41] Accelerating Language Model Workflows with Prompt Choreography

TJ Bai, Jason Eisner

Main category: cs.CL

TL;DR: Prompt Choreography is a framework that speeds up multi-agent LLM workflows by using a global KV cache to avoid redundant computation, achieving 2-6× faster latency and >2.2× end-to-end speedups.

Details

Motivation: As LLMs are increasingly deployed in multi-agent workflows, there's a need to improve efficiency by reducing redundant computation across multiple LLM calls that often process similar or overlapping content.

Method: Introduces a dynamic global KV cache that stores encoded messages. Each LLM call can attend to arbitrary subsets of previously encoded messages, supporting parallel calls. Fine-tunes LLMs to work effectively with cached encodings rather than always re-encoding.

Result: Achieves 2.0-6.2× faster time-to-first-token per message and >2.2× end-to-end speedups in workflows dominated by redundant computation. Fine-tuning helps LLMs mimic original results despite potential differences from using cached encodings.

Conclusion: Prompt Choreography provides an effective framework for accelerating multi-agent LLM workflows through intelligent caching, significantly reducing latency and computational overhead while maintaining result quality.

Abstract: Large language models are increasingly deployed in multi-agent workflows. We introduce Prompt Choreography, a framework that efficiently executes LLM workflows by maintaining a dynamic, global KV cache. Each LLM call can attend to an arbitrary, reordered subset of previously encoded messages. Parallel calls are supported. Though caching messages’ encodings sometimes gives different results from re-encoding them in a new context, we show in diverse settings that fine-tuning the LLM to work with the cache can help it mimic the original results. Prompt Choreography significantly reduces per-message latency (2.0–6.2$\times$ faster time-to-first-token) and achieves substantial end-to-end speedups ($>$2.2$\times$) in some workflows dominated by redundant computation.

[42] TabiBERT: A Large-Scale ModernBERT Foundation Model and Unified Benchmarking Framework for Turkish

Melikşah Türker, A. Ebrar Kızıloğlu, Onur Güngör, Susan Üsküdarlı

Main category: cs.CL

TL;DR: TabiBERT is a new monolingual Turkish encoder based on ModernBERT architecture, trained from scratch on 1 trillion tokens from a multi-domain corpus, achieving state-of-the-art performance on Turkish NLP tasks.

Details

Motivation: Turkish NLP lacks a monolingual encoder trained from scratch with modern architectural improvements like RoPE, FlashAttention, and refined normalization that have evolved since BERT's inception.

Method: Developed TabiBERT using ModernBERT architecture with Rotary Positional Embeddings, FlashAttention, and refined normalization. Trained from scratch on 1 trillion tokens from a curated multi-domain corpus (73% web text, 20% scientific publications, 6% source code, 0.3% mathematical content). Created TabiBench with 28 datasets across 8 task categories for evaluation.

Result: TabiBERT achieves 77.58 on TabiBench, outperforming BERTurk by 1.62 points. Supports 8,192-token context length (16x original BERT), achieves up to 2.65x inference speedup, reduces GPU memory consumption. Establishes SOTA on 5 of 8 categories: question answering (+9.55), code retrieval (+2.41), document retrieval (+0.60). Achieves +1.47 average improvement over prior best results.

Conclusion: TabiBERT successfully addresses the gap in Turkish NLP by providing a modern monolingual encoder with improved efficiency, longer context, and superior performance across diverse domains, demonstrating robust cross-domain generalization.

Abstract: Since the inception of BERT, encoder-only Transformers have evolved significantly in computational efficiency, training stability, and long-context modeling. ModernBERT consolidates these advances by integrating Rotary Positional Embeddings (RoPE), FlashAttention, and refined normalization. Despite these developments, Turkish NLP lacks a monolingual encoder trained from scratch incorporating such modern architectural paradigms. This work introduces TabiBERT, a monolingual Turkish encoder based on ModernBERT architecture trained from scratch on a large, curated corpus. TabiBERT is pre-trained on one trillion tokens sampled from an 84.88B token multi-domain corpus: web text (73%), scientific publications (20%), source code (6%), and mathematical content (0.3%). The model supports 8,192-token context length (16x original BERT), achieves up to 2.65x inference speedup, and reduces GPU memory consumption, enabling larger batch sizes. We introduce TabiBench with 28 datasets across eight task categories with standardized splits and protocols, evaluated using GLUE-style macro-averaging. TabiBERT attains 77.58 on TabiBench, outperforming BERTurk by 1.62 points and establishing state-of-the-art on five of eight categories: question answering (+9.55), code retrieval (+2.41), and document retrieval (+0.60). Compared with task-specific prior best results, including specialized models like TurkishBERTweet, TabiBERT achieves +1.47 average improvement, indicating robust cross-domain generalization. We release model weights, training configurations, and evaluation code for transparent, reproducible Turkish encoder research.

[43] Reservoir Computing inspired Matrix Multiplication-free Language Model

Takumi Shiratsuchi, Yuichiro Tanaka, Hakaru Tamukoh

Main category: cs.CL

TL;DR: Proposes a matrix multiplication-free language model with reservoir computing layers to reduce computational costs while maintaining performance.

Details

Motivation: Large language models have high computational costs that limit their practical deployment, so the paper aims to improve computational efficiency through architecture modifications.

Method: Uses a MatMul-free language model with partially fixed/shared weights, inserts reservoir computing layers for rich dynamic representations, and combines operations to reduce memory accesses.

Result: Reduces parameters by up to 19%, training time by 9.9%, inference time by 8.0% while maintaining comparable performance to baseline.

Conclusion: The proposed architecture successfully reduces computational costs of LLMs through MatMul-free design and reservoir computing techniques without sacrificing performance.

Abstract: Large language models (LLMs) have achieved state-of-the-art performance in natural language processing; however, their high computational cost remains a major bottleneck. In this study, we target computational efficiency by focusing on a matrix multiplication free language model (MatMul-free LM) and further reducing the training cost through an architecture inspired by reservoir computing. Specifically, we partially fix and share the weights of selected layers in the MatMul-free LM and insert reservoir layers to obtain rich dynamic representations without additional training overhead. Additionally, several operations are combined to reduce memory accesses. Experimental results show that the proposed architecture reduces the number of parameters by up to 19%, training time by 9.9%, and inference time by 8.0%, while maintaining comparable performance to the baseline model.

[44] Not too long do read: Evaluating LLM-generated extreme scientific summaries

Zhuoqi Lyu, Qing Ke

Main category: cs.CL

TL;DR: Researchers create BiomedTLDR dataset of human-written scientific summaries to evaluate LLM summarization, finding LLMs tend to be more extractive than abstractive compared to humans.

Details

Motivation: There's a lack of comprehensive, high-quality scientific TLDR datasets to develop and evaluate LLMs' summarization abilities, and we need to understand how LLM-generated summaries differ from human expert summaries.

Method: Propose BiomedTLDR dataset containing researcher-authored summaries from scientific papers (using authors’ comments in bibliographies), then test popular open-weight LLMs for generating TLDRs based on abstracts.

Result: While some LLMs successfully produce humanoid summaries, they generally exhibit greater affinity for original text’s lexical choices and rhetorical structures, making them more extractive rather than abstractive compared to humans.

Conclusion: The BiomedTLDR dataset enables better evaluation of LLM summarization capabilities, revealing key differences between LLM and human summarization approaches that need addressing.

Abstract: High-quality scientific extreme summary (TLDR) facilitates effective science communication. How do large language models (LLMs) perform in generating them? How are LLM-generated summaries different from those written by human experts? However, the lack of a comprehensive, high-quality scientific TLDR dataset hinders both the development and evaluation of LLMs’ summarization ability. To address these, we propose a novel dataset, BiomedTLDR, containing a large sample of researcher-authored summaries from scientific papers, which leverages the common practice of including authors’ comments alongside bibliography items. We then test popular open-weight LLMs for generating TLDRs based on abstracts. Our analysis reveals that, although some of them successfully produce humanoid summaries, LLMs generally exhibit a greater affinity for the original text’s lexical choices and rhetorical structures, hence tend to be more extractive rather than abstractive in general, compared to humans. Our code and datasets are available at https://github.com/netknowledge/LLM_summarization (Lyu and Ke, 2025).

[45] Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

Zhijun Chen, Zeyu Ji, Qianren Mao, Junhang Cheng, Bangjie Qin, Hao Wu, Zhuoran Li, Jingzheng Li, Kai Sun, Zizhe Wang, Yikun Ban, Zhu Sun, Xiangyang Ji, Hailong Sun

Main category: cs.CL

TL;DR: LLM-PeerReview is an unsupervised ensemble method that selects the best response from multiple LLM-generated candidates using a peer-review-inspired framework with LLM-as-a-Judge scoring and aggregation.

Details

Motivation: To harness the collective wisdom of multiple LLMs with diverse strengths by creating an interpretable, unsupervised ensemble method that can select the most ideal response from multiple candidates without requiring labeled training data.

Method: A three-stage peer-review-inspired framework: 1) Scoring: Use LLM-as-a-Judge technique where multiple LLMs evaluate each response; 2) Reasoning: Aggregate scores using either graphical model-based truth inference or simple averaging; 3) Selection: Choose the highest-scoring response as the ensemble output.

Result: Two variants of LLM-PeerReview achieve strong results across four datasets, outperforming the recent advanced model Smoothie-Global by 6.9% and 7.3% points respectively.

Conclusion: LLM-PeerReview provides a conceptually simple yet empirically powerful unsupervised ensemble method that leverages multiple LLMs’ collective judgment to select optimal responses, offering interpretability and strong performance gains over existing approaches.

Abstract: We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a clear and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a principled graphical model-based truth inference algorithm or a straightforward averaging strategy to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. The two variants of the proposed approach obtain strong results across four datasets, including outperforming the recent advanced model Smoothie-Global by 6.9% and 7.3% points, respectively.

[46] Anka: A Domain-Specific Language for Reliable LLM Code Generation

Saif Khalfan Saif Al Mazrouei

Main category: cs.CL

TL;DR: LLMs struggle with complex programming tasks in general-purpose languages due to flexibility/ambiguity. A purpose-designed DSL (Anka) enables near-perfect code generation despite zero prior training, outperforming Python significantly on multi-step tasks.

Details

Motivation: LLMs show systematic errors on complex, multi-step programming tasks despite strong code generation capabilities. The authors hypothesize that the flexibility of general-purpose languages (multiple valid approaches, implicit state management) contributes to these errors.

Method: Introduced Anka, a domain-specific language for data transformation pipelines with explicit, constrained syntax to reduce ambiguity. Evaluated LLMs (Claude 3.5 Haiku, GPT-4o-mini) on 100 benchmark problems with zero prior training exposure to Anka, comparing performance against Python.

Result: Claude 3.5 Haiku achieved 99.9% parse success and 95.8% overall task accuracy on Anka. Anka showed 40 percentage point advantage over Python on multi-step pipeline tasks (100% vs. 60%). Cross-validation with GPT-4o-mini confirmed similar advantage (+26.7 percentage points).

Conclusion: LLMs can learn novel DSLs entirely from in-context prompts with near-native accuracy. Constrained syntax significantly reduces errors on complex tasks. Purpose-designed DSLs for LLM generation can outperform general-purpose languages despite extensive training on the latter.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, yet they exhibit systematic errors on complex, multi-step programming tasks. We hypothesize that these errors stem from the flexibility of general-purpose languages, which permits multiple valid approaches and requires implicit state management. To test this hypothesis, we introduce Anka, a domain-specific language (DSL) for data transformation pipelines designed with explicit, constrained syntax that reduces ambiguity in code generation. Despite having zero prior training exposure to Anka, Claude 3.5 Haiku achieves 99.9% parse success and 95.8% overall task accuracy across 100 benchmark problems. Critically, Anka demonstrates a 40 percentage point accuracy advantage over Python on multi-step pipeline tasks (100% vs. 60%), where Python’s flexible syntax leads to frequent errors in operation sequencing and variable management. Cross-model validation with GPT-4o-mini confirms this advantage (+26.7 percentage points on multi-step tasks). Our results demonstrate that: (1) LLMs can learn novel DSLs entirely from in-context prompts, achieving near-native accuracy; (2) constrained syntax significantly reduces errors on complex tasks; and (3) domain-specific languages purposefully designed for LLM generation can outperform general-purpose languages on which the LLM has extensive training. We release the complete language implementation, benchmark suite, and evaluation framework to facilitate further research.

[47] Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation

Dianyun Wang, Qingsen Ma, Yuhu Shang, Zhifeng Lu, Lechen Ning, Zhenbo Xu, Huijia Wu, Zhaofeng He

Main category: cs.CL

TL;DR: SAE-guided LoRA initialization uses sparse autoencoders to identify task-relevant features in disentangled space, enabling interpretable and high-performance parameter-efficient fine-tuning with theoretical guarantees.

Details

Motivation: Current LoRA methods learn low-rank subspaces implicitly without interpretability or control. The authors hypothesize this stems from polysemanticity (entangled concepts) in neural representations, and aim to create explicit, interpretable subspaces using mechanistic interpretability.

Method: Leverage pre-trained Sparse Autoencoders (SAEs) to identify task-relevant features in a disentangled feature space, then construct explicit low-rank subspaces to guide adapter initialization. Provide theoretical analysis showing SAE-based identification achieves arbitrarily small error under monosemanticity assumptions.

Result: Achieves up to 99.6% safety rate on safety alignment tasks, exceeding full fine-tuning by 7.4 percentage points and approaching RLHF-based methods, while updating only 0.19-0.24% of parameters. Provides interpretable insights into learned alignment subspaces through SAE feature semantics.

Conclusion: Incorporating mechanistic interpretability into fine-tuning can simultaneously improve both performance and transparency, demonstrating that explicit feature identification via SAEs enables better control and understanding of adaptation processes.

Abstract: Parameter-efficient fine-tuning has become the dominant paradigm for adapting large language models to downstream tasks. Low-rank adaptation methods such as LoRA operate under the assumption that task-relevant weight updates reside in a low-rank subspace, yet this subspace is learned implicitly from data in a black-box manner, offering no interpretability or direct control. We hypothesize that this difficulty stems from polysemanticity–individual dimensions encoding multiple entangled concepts. To address this, we leverage pre-trained Sparse Autoencoders (SAEs) to identify task-relevant features in a disentangled feature space, then construct an explicit, interpretable low-rank subspace to guide adapter initialization. We provide theoretical analysis proving that under monosemanticity assumptions, SAE-based subspace identification achieves arbitrarily small recovery error, while direct identification in polysemantic space suffers an irreducible error floor. On safety alignment, our method achieves up to 99.6% safety rate–exceeding full fine-tuning by 7.4 percentage points and approaching RLHF-based methods–while updating only 0.19-0.24% of parameters. Crucially, our method provides interpretable insights into the learned alignment subspace through the semantic grounding of SAE features. Our work demonstrates that incorporating mechanistic interpretability into the fine-tuning process can simultaneously improve both performance and transparency.

[48] Chinese Morph Resolution in E-commerce Live Streaming Scenarios

Jiahao Zhu, Jipeng Qiang, Ran Bai, Chenyu Liu, Xiaoye Ouyang

Main category: cs.CL

TL;DR: Researchers introduce LiveAMR task to detect pronunciation-based evasion in Chinese e-commerce live streams, create first dataset of 86,790 samples, and use LLM-augmented text-to-text generation for morph resolution.

Details

Motivation: E-commerce live streaming hosts in China use pronunciation morphs to evade scrutiny and engage in false advertising, particularly in health/medical streams, creating a need for automated detection systems.

Method: Transform LiveAMR task into text-to-text generation problem, construct first dataset with 86,790 samples, leverage LLMs to generate additional training data for improved performance.

Result: Developed effective morph resolution system that significantly enhances live streaming regulation capabilities, demonstrating the value of LLM-augmented approaches for this novel task.

Conclusion: LiveAMR addresses a critical gap in e-commerce regulation by detecting pronunciation-based evasion, with the proposed method showing promise for improving oversight of deceptive live streaming practices.

Abstract: E-commerce live streaming in China, particularly on platforms like Douyin, has become a major sales channel, but hosts often use morphs to evade scrutiny and engage in false advertising. This study introduces the Live Auditory Morph Resolution (LiveAMR) task to detect such violations. Unlike previous morph research focused on text-based evasion in social media and underground industries, LiveAMR targets pronunciation-based evasion in health and medical live streams. We constructed the first LiveAMR dataset with 86,790 samples and developed a method to transform the task into a text-to-text generation problem. By leveraging large language models (LLMs) to generate additional training data, we improved performance and demonstrated that morph resolution significantly enhances live streaming regulation.

[49] AI4Reading: Chinese Audiobook Interpretation System Based on Multi-Agent Collaboration

Minjiang Huang, Jipeng Qiang, Yi Zhu, Chaowei Zhang, Xiangyu Zhao, Kui Yu

Main category: cs.CL

TL;DR: AI4Reading is a multi-agent LLM system that automatically generates podcast-style audiobook interpretations using 11 specialized agents to analyze topics, extract cases, refine content, and synthesize speech.

Details

Motivation: Manual creation of audiobook interpretations is time-consuming and resource-intensive, despite their growing popularity for providing accessible, in-depth book analyses with practical insights.

Method: Multi-agent collaboration system with 11 specialized agents (topic analysts, case analysts, editors, narrator, proofreaders) using LLMs and speech synthesis to ensure accurate content preservation, enhanced comprehensibility, and logical narrative structure.

Result: Compared to expert interpretations, AI4Reading generates simpler and more accurate interpretative scripts, though there’s still a gap in speech generation quality compared to human narration.

Conclusion: AI4Reading demonstrates promising automated generation of audiobook interpretations with accurate and simplified content, though speech synthesis quality needs further improvement to match human standards.

Abstract: Audiobook interpretations are attracting increasing attention, as they provide accessible and in-depth analyses of books that offer readers practical insights and intellectual inspiration. However, their manual creation process remains time-consuming and resource-intensive. To address this challenge, we propose AI4Reading, a multi-agent collaboration system leveraging large language models (LLMs) and speech synthesis technology to generate podcast, like audiobook interpretations. The system is designed to meet three key objectives: accurate content preservation, enhanced comprehensibility, and a logical narrative structure. To achieve these goals, we develop a framework composed of 11 specialized agents,including topic analysts, case analysts, editors, a narrator, and proofreaders that work in concert to explore themes, extract real world cases, refine content organization, and synthesize natural spoken language. By comparing expert interpretations with our system’s output, the results show that although AI4Reading still has a gap in speech generation quality, the generated interpretative scripts are simpler and more accurate.

[50] AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents

Jiafeng Liang, Hao Li, Chang Li, Jiaqi Zhou, Shixin Jiang, Zekun Wang, Changkai Ji, Zhihao Zhu, Runxuan Liu, Tao Ren, Jinlan Fu, See-Kiong Ng, Xia Liang, Ming Liu, Bing Qin

Main category: cs.CL

TL;DR: This paper provides a comprehensive interdisciplinary review connecting cognitive neuroscience memory mechanisms with LLM-driven autonomous agents, covering memory definitions, taxonomies, storage, management lifecycle, benchmarks, security, and future directions.

Details

Motivation: The motivation is to bridge the interdisciplinary gap between cognitive neuroscience and AI agent research. Existing autonomous agent memory systems struggle to assimilate the essence of human memory mechanisms due to disciplinary barriers, limiting their effectiveness in complex tasks.

Method: The paper systematically synthesizes interdisciplinary knowledge by: 1) elucidating memory definitions and functions across cognitive neuroscience, LLMs, and agents; 2) providing comparative analysis of memory taxonomy, storage mechanisms, and management lifecycle from biological and artificial perspectives; 3) reviewing mainstream benchmarks for evaluating agent memory; 4) exploring memory security from attack and defense perspectives.

Result: The paper results in a comprehensive interdisciplinary framework that connects cognitive neuroscience insights with LLM-driven agents, providing comparative analyses of memory systems, evaluation benchmarks, and security considerations.

Conclusion: The conclusion emphasizes the importance of bridging cognitive neuroscience and AI agent research for developing more effective memory systems, and envisions future research directions focusing on multimodal memory systems and skill acquisition.

Abstract: Memory serves as the pivotal nexus bridging past and future, providing both humans and AI systems with invaluable concepts and experience to navigate complex tasks. Recent research on autonomous agents has increasingly focused on designing efficient memory workflows by drawing on cognitive neuroscience. However, constrained by interdisciplinary barriers, existing works struggle to assimilate the essence of human memory mechanisms. To bridge this gap, we systematically synthesizes interdisciplinary knowledge of memory, connecting insights from cognitive neuroscience with LLM-driven agents. Specifically, we first elucidate the definition and function of memory along a progressive trajectory from cognitive neuroscience through LLMs to agents. We then provide a comparative analysis of memory taxonomy, storage mechanisms, and the complete management lifecycle from both biological and artificial perspectives. Subsequently, we review the mainstream benchmarks for evaluating agent memory. Additionally, we explore memory security from dual perspectives of attack and defense. Finally, we envision future research directions, with a focus on multimodal memory systems and skill acquisition.

[51] A Stepwise-Enhanced Reasoning Framework for Large Language Models Based on External Subgraph Generation

Xin Zhang, Yang Cao, Baoxing Wu, Xinyi Chen, Kai Song, Siying Li

Main category: cs.CL

TL;DR: SGR: A stepwise reasoning enhancement framework for LLMs that uses external subgraph generation to improve logical inference by reducing noise and guiding structured reasoning.

Details

Motivation: LLMs struggle with deep reasoning tasks due to incorporation of noisy/irrelevant information from training data, leading to incorrect predictions and factual inconsistencies.

Method: Proposes SGR framework that: 1) dynamically constructs query-relevant subgraphs from external knowledge bases, 2) guides step-by-step reasoning over structured subgraphs, 3) integrates multiple reasoning paths for final answer.

Result: Experimental results on multiple benchmark datasets show SGR consistently outperforms strong baselines, demonstrating effectiveness in enhancing LLM reasoning capabilities.

Conclusion: SGR successfully addresses LLM reasoning limitations by leveraging external structured knowledge through subgraph-guided stepwise reasoning, improving accuracy and reducing noise influence.

Abstract: Large Language Models (LLMs) have achieved strong performance across a wide range of natural language processing tasks in recent years, including machine translation, text generation, and question answering. As their applications extend to increasingly complex scenarios, however, LLMs continue to face challenges in tasks that require deep reasoning and logical inference. In particular, models trained on large scale textual corpora may incorporate noisy or irrelevant information during generation, which can lead to incorrect predictions or outputs that are inconsistent with factual knowledge. To address this limitation, we propose a stepwise reasoning enhancement framework for LLMs based on external subgraph generation, termed SGR. The proposed framework dynamically constructs query relevant subgraphs from external knowledge bases and leverages their semantic structure to guide the reasoning process. By performing reasoning in a step by step manner over structured subgraphs, SGR reduces the influence of noisy information and improves reasoning accuracy. Specifically, the framework first generates an external subgraph tailored to the input query, then guides the model to conduct multi step reasoning grounded in the subgraph, and finally integrates multiple reasoning paths to produce the final answer. Experimental results on multiple benchmark datasets demonstrate that SGR consistently outperforms strong baselines, indicating its effectiveness in enhancing the reasoning capabilities of LLMs.

[52] Entropy-Guided Token Dropout: Training Autoregressive Language Models with Limited Domain Data

Jiapeng Wang, Yiwen Hu, Yanzipeng Gao, Haoyu Wang, Shuo Wang, Hongyu Lu, Jiaxin Mao, Wayne Xin Zhao, Junyi Li, Xiao Zhang

Main category: cs.CL

TL;DR: EntroDrop is an entropy-guided token dropout method that prevents performance degradation in LLMs during multi-epoch training by selectively masking low-entropy tokens and using a curriculum schedule to adjust regularization strength.

Details

Motivation: As high-quality domain-specific data becomes scarce, multi-epoch training is necessary for adapting LLMs, but autoregressive models suffer performance degradation from overfitting when repeatedly exposed to the same data. This degradation stems from an imbalance where low-entropy tokens dominate learning while high-entropy token generalization deteriorates.

Method: EntroDrop is an entropy-guided token dropout method that functions as structured data regularization. It selectively masks low-entropy tokens during training and employs a curriculum schedule to adjust regularization strength based on training progress.

Result: Experiments across model scales from 0.6B to 8B parameters show that EntroDrop consistently outperforms standard regularization baselines and maintains robust performance throughout extended multi-epoch training.

Conclusion: The findings highlight the importance of aligning regularization with token-level learning dynamics when training on limited data. EntroDrop offers a promising pathway for more effective adaptation of LLMs in data-constrained domains.

Abstract: As access to high-quality, domain-specific data grows increasingly scarce, multi-epoch training has become a practical strategy for adapting large language models (LLMs). However, autoregressive models often suffer from performance degradation under repeated data exposure, where overfitting leads to a marked decline in model capability. Through empirical analysis, we trace this degradation to an imbalance in learning dynamics: predictable, low-entropy tokens are learned quickly and come to dominate optimization, while the model’s ability to generalize on high-entropy tokens deteriorates with continued training. To address this, we introduce EntroDrop, an entropy-guided token dropout method that functions as structured data regularization. EntroDrop selectively masks low-entropy tokens during training and employs a curriculum schedule to adjust regularization strength in alignment with training progress. Experiments across model scales from 0.6B to 8B parameters show that EntroDrop consistently outperforms standard regularization baselines and maintains robust performance throughout extended multi-epoch training. These findings underscore the importance of aligning regularization with token-level learning dynamics when training on limited data. Our approach offers a promising pathway toward more effective adaptation of LLMs in data-constrained domains.

[53] The Effect of Gender Diversity on Scientific Team Impact: A Team Roles Perspective

Yi Zhao, Yongjun Zhu, Donghun Kim, Yuzhuo Wang, Heng Zhang, Chao Lu, Chengzhi Zhang

Main category: cs.CL

TL;DR: Gender diversity in scientific teams shows complex effects: inverted U-shaped relationships for both leadership and support roles, with all-female leadership plus all-male support performing best, and team size moderating leadership diversity effects.

Details

Motivation: Prior research on gender diversity in scientific teams has inconsistent findings and overlooks internal role differentiation, limiting understanding of how gender diversity across different team roles affects team impact.

Method: Analyzed 130,000+ PLOS papers using author contribution statements to classify members into leadership and support roles. Used multivariable regression to examine gender diversity’s association with team impact (5-year citations), plus threshold regression to study team size moderation.

Result: 1) Gender diversity shows inverted U-shape relationship with impact for both leadership and support groups. 2) Teams with all-female leadership and all-male support achieve highest impact. 3) Leadership diversity has negative effect for small teams but becomes positive/insignificant for large teams, while support diversity remains positive regardless of size.

Conclusion: Gender diversity effects are nuanced and role-dependent. Optimal team composition varies by role, with all-female leadership combined with all-male support performing best, and team size significantly moderates leadership diversity effects.

Abstract: The influence of gender diversity on the success of scientific teams is of great interest to academia. However, prior findings remain inconsistent, and most studies operationalize diversity in aggregate terms, overlooking internal role differentiation. This limitation obscures a more nuanced understanding of how gender diversity shapes team impact. In particular, the effect of gender diversity across different team roles remains poorly understood. To this end, we define a scientific team as all coauthors of a paper and measure team impact through five-year citation counts. Using author contribution statements, we classified members into leadership and support roles. Drawing on more than 130,000 papers from PLOS journals, most of which are in biomedical-related disciplines, we employed multivariable regression to examine the association between gender diversity in these roles and team impact. Furthermore, we apply a threshold regression model to investigate how team size moderates this relationship. The results show that (1) the relationship between gender diversity and team impact follows an inverted U-shape for both leadership and support groups; (2) teams with an all-female leadership group and an all-male support group achieve higher impact than other team types. Interestingly, (3) the effect of leadership-group gender diversity is significantly negative for small teams but becomes positive and statistically insignificant in large teams. In contrast, the estimates for support-group gender diversity remain significant and positive, regardless of team size.

[54] C2PO: Diagnosing and Disentangling Bias Shortcuts in LLMs

Xuan Feng, Bo An, Tianlong Gu, Liang Chang, Fengrui Hao, Peipeng Yu, Shuai Zhao

Main category: cs.CL

TL;DR: C2PO is a unified alignment framework that simultaneously mitigates both stereotypical and structural biases in LLMs by discovering and suppressing spurious feature correlations through causal counterfactual signals and fairness-sensitive preference optimization.

Details

Motivation: Current bias mitigation approaches typically address stereotypical biases (gender/racial stereotypes) and structural biases (lexical overlap/position preferences) in isolation, often exacerbating one while fixing the other. There's a need for a unified framework that tackles both simultaneously.

Method: Causal-Contrastive Preference Optimization (C2PO) uses causal counterfactual signals to isolate bias-inducing features from valid reasoning paths, and employs a fairness-sensitive preference update mechanism to dynamically evaluate logit-level contributions and suppress shortcut features.

Result: Extensive experiments across multiple benchmarks (BBQ, Unqover, MNLI, HANS, Chatbot, MT-Bench, StereoSet, WinoBias, MMLU, GSM8K) show C2PO effectively mitigates both stereotypical and structural biases while preserving general reasoning capabilities.

Conclusion: C2PO provides a unified solution to address both stereotypical and structural biases in LLMs by targeting the root cause - spurious feature correlations - through causal reasoning and preference optimization, maintaining model utility while improving fairness.

Abstract: Bias in Large Language Models (LLMs) poses significant risks to trustworthiness, manifesting primarily as stereotypical biases (e.g., gender or racial stereotypes) and structural biases (e.g., lexical overlap or position preferences). However, prior paradigms typically address these in isolation, often mitigating one at the expense of exacerbating the other. To address this, we conduct a systematic exploration of these reasoning failures and identify a primary inducement: the latent spurious feature correlations within the input that drive these erroneous reasoning shortcuts. Driven by these findings, we introduce Causal-Contrastive Preference Optimization (C2PO), a unified alignment framework designed to tackle these specific failures by simultaneously discovering and suppressing these correlations directly within the optimization process. Specifically, C2PO leverages causal counterfactual signals to isolate bias-inducing features from valid reasoning paths, and employs a fairness-sensitive preference update mechanism to dynamically evaluate logit-level contributions and suppress shortcut features. Extensive experiments across multiple benchmarks covering stereotypical bias (BBQ, Unqover), structural bias (MNLI, HANS, Chatbot, MT-Bench), out-of-domain fairness (StereoSet, WinoBias), and general utility (MMLU, GSM8K) demonstrate that C2PO effectively mitigates stereotypical and structural biases while preserving robust general reasoning capabilities.

[55] ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning

Yuqi Tang, Jing Yu, Zichang Su, Kehua Feng, Zhihui Zhu, Libin Wang, Lei Liang, Qiang Zhang, Keyan Ding, Huajun Chen

Main category: cs.CL

TL;DR: ClinDEF is a dynamic framework for evaluating LLMs’ clinical reasoning through simulated diagnostic dialogues, going beyond static QA to assess diagnostic quality and efficiency.

Details

Motivation: Existing LLM benchmarks focus on static question-answering and poorly represent the dynamic clinical reasoning process of real doctor-patient interactions. Current dynamic medical frameworks rely on limited, contamination-prone datasets and lack granular, multi-level evaluation.

Method: ClinDEF uses a disease knowledge graph to dynamically generate patient cases and facilitates multi-turn interactions between an LLM-based doctor and an automated patient agent. It includes fine-grained efficiency analysis and rubric-based assessment of diagnostic quality.

Result: Experiments show that ClinDEF effectively exposes critical clinical reasoning gaps in state-of-the-art LLMs, offering a more nuanced and clinically meaningful evaluation paradigm.

Conclusion: ClinDEF provides a comprehensive framework for assessing clinical reasoning in LLMs through dynamic diagnostic dialogues, addressing limitations of existing benchmarks and offering more clinically relevant evaluation.

Abstract: Clinical diagnosis begins with doctor-patient interaction, during which physicians iteratively gather information, determine examination and refine differential diagnosis through patients’ response. This dynamic clinical-reasoning process is poorly represented by existing LLM benchmarks that focus on static question-answering. To mitigate these gaps, recent methods explore dynamic medical frameworks involving interactive clinical dialogues. Although effective, they often rely on limited, contamination-prone datasets and lack granular, multi-level evaluation. In this work, we propose ClinDEF, a dynamic framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues. Grounded in a disease knowledge graph, our method dynamically generates patient cases and facilitates multi-turn interactions between an LLM-based doctor and an automated patient agent. Our evaluation protocol goes beyond diagnostic accuracy by incorporating fine-grained efficiency analysis and rubric-based assessment of diagnostic quality. Experiments show that ClinDEF effectively exposes critical clinical reasoning gaps in state-of-the-art LLMs, offering a more nuanced and clinically meaningful evaluation paradigm.

[56] Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao

Main category: cs.CL

TL;DR: Proposes ERC loss to align router decisions with expert capabilities in MoE models through lightweight constraints on proxy token activations.

Details

Motivation: Current MoE models lack explicit constraints to ensure router decisions align with expert capabilities, limiting model performance.

Method: Introduces expert-router coupling (ERC) loss that treats router embeddings as proxy tokens, feeds perturbed embeddings through experts, and enforces two constraints: 1) each expert shows higher activation for its own proxy token than others’, 2) each proxy token elicits stronger activation from its corresponding expert than others.

Result: Demonstrates effectiveness through pre-training MoE-LLMs from 3B to 15B parameters on trillions of tokens; offers flexible control and quantitative tracking of expert specialization.

Conclusion: ERC loss provides computationally efficient (O(n²) cost) solution to align router decisions with expert capabilities, enabling better MoE performance and insights into expert specialization.

Abstract: Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router’s decisions align well with the experts’ capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router’s decisions with expert capabilities. Our approach treats each expert’s router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain internal activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding faithfully represents its corresponding expert’s capability, while each expert specializes in processing the tokens actually routed to it. The ERC loss is computationally efficient, operating only on n^2 activations, where n is the number of experts. This represents a fixed cost independent of batch size, unlike prior coupling methods that scale with the number of tokens (often millions per batch). Through pre-training MoE-LLMs ranging from 3B to 15B parameters and extensive analysis on trillions of tokens, we demonstrate the effectiveness of the ERC loss. Moreover, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing valuable insights into MoEs.

[57] Semantic Tree Inference on Text Corpa using a Nested Density Approach together with Large Language Model Embeddings

Thomas Haschka, Joseph Bakarji

Main category: cs.CL

TL;DR: The paper proposes a nested density clustering method to build hierarchical trees from LLM embeddings, revealing semantic relationships in text corpora without predefined categories.

Details

Motivation: While LLM embeddings are widely used for semantic similarity search, the global hierarchical structure and semantic relationships in text corpora often remain unclear and opaque.

Method: A nested density clustering approach that starts by identifying dense clusters in LLM embedding space, then gradually relaxes the density criterion to merge clusters hierarchically until forming a single root cluster, constructing a tree structure.

Result: The method successfully classifies scientific abstracts, discovers research areas and subfields without predefined categories, and demonstrates robustness on benchmark datasets like 20 Newsgroups and IMDB 50k Movie Reviews.

Conclusion: Nested density trees can effectively reveal semantic structure and evolution in textual datasets, with applications in scientometrics, topic evolution analysis, and data-driven discovery of hierarchical relationships.

Abstract: Semantic text classification has undergone significant advances in recent years due to the rise of large language models (LLMs) and their high dimensional embeddings. While LLM-embeddings are frequently used to store and retrieve text by semantic similarity in vector databases, the global structure semantic relationships in text corpora often remains opaque. Herein we propose a nested density clustering approach, to infer hierarchical trees of semantically related texts. The method starts by identifying texts of strong semantic similarity as it searches for dense clusters in LLM embedding space. As the density criterion is gradually relaxed, these dense clusters merge into more diffuse clusters, until the whole dataset is represented by a single cluster – the root of the tree. By embedding dense clusters into increasingly diffuse ones, we construct a tree structure that captures hierarchical semantic relationships among texts. We outline how this approach can be used to classify textual data for abstracts of scientific abstracts as a case study. This enables the data-driven discovery research areas and their subfields without predefined categories. To evaluate the general applicability of the method, we further apply it to established benchmark datasets such as the 20 Newsgroups and IMDB 50k Movie Reviews, demonstrating its robustness across domains. Finally we discuss possible applications on scientometrics, topic evolution, highlighting how nested density trees can reveal semantic structure and evolution in textual datasets.

[58] Automatic Detection of Complex Quotation Patterns in Aggadic Literature

Hadar Miller, Tsvi Kuflik, Moshe Lavee

Main category: cs.CL

TL;DR: ACT is a three-stage algorithm for detecting biblical quotations in Rabbinic literature that outperforms existing systems by handling short, paraphrased, and structurally embedded citations through morphology-aware alignment and context-sensitive enrichment.

Details

Motivation: Existing text reuse frameworks struggle with detecting short, paraphrased, or structurally embedded biblical quotations in Rabbinic literature, creating a methodological gap between machine-based detection and human editorial judgment.

Method: ACT combines a morphology-aware alignment algorithm with a context-sensitive enrichment stage to identify complex citation patterns like “Wave” and “Echo” quotations. The three-stage approach includes different configurations (ACT-QE, ACT-2, ACT-3) to balance recall and precision.

Result: The full ACT pipeline (ACT-QE) achieves F1 score of 0.91 with superior recall (0.89) and precision (0.94), outperforming leading systems like Dicta, Passim, and Text-Matcher. Different configurations offer tradeoffs: ACT-2 has higher recall (0.90) but lower precision, while ACT-3 balances coverage and specificity.

Conclusion: ACT addresses the methodological gap in quotation detection for digital humanities and computational philology, with applications extending to genre classification and intertextual analysis, particularly valuable for morphologically rich and citation-dense traditions like Aggadic literature.

Abstract: This paper presents ACT (Allocate Connections between Texts), a novel three-stage algorithm for the automatic detection of biblical quotations in Rabbinic literature. Unlike existing text reuse frameworks that struggle with short, paraphrased, or structurally embedded quotations, ACT combines a morphology-aware alignment algorithm with a context-sensitive enrichment stage that identifies complex citation patterns such as “Wave” and “Echo” quotations. Our approach was evaluated against leading systems, including Dicta, Passim, Text-Matcher, as well as human-annotated critical editions. We further assessed three ACT configurations to isolate the contribution of each component. Results demonstrate that the full ACT pipeline (ACT-QE) outperforms all baselines, achieving an F1 score of 0.91, with superior Recall (0.89) and Precision (0.94). Notably, ACT-2, which lacks stylistic enrichment, achieves higher Recall (0.90) but suffers in Precision, while ACT-3, using longer n-grams, offers a tradeoff between coverage and specificity. In addition to improving quotation detection, ACT’s ability to classify stylistic patterns across corpora opens new avenues for genre classification and intertextual analysis. This work contributes to digital humanities and computational philology by addressing the methodological gap between exhaustive machine-based detection and human editorial judgment. ACT lays a foundation for broader applications in historical textual analysis, especially in morphologically rich and citation-dense traditions like Aggadic literature.

[59] UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?

Fengjiao Chen, Minhao Jing, Weitao Lu, Yan Feng, Xiaoyu Li, Xuezhi Cao

Main category: cs.CL

TL;DR: UniHetero shows that semantic generation (not pixel generation) improves vision-language understanding, reveals superior data scaling trends, and uses autoregression on input embeddings to capture visual details.

Details

Motivation: To explore whether visual generation tasks can enhance visual understanding in unified vision-language models at large-scale pretraining (>200M samples), addressing the under-explored relationship between generation and understanding.

Method: Proposes UniHetero, a concise unified model structure, analyzed under large-scale pretraining. Uses autoregression on input embeddings to capture visual details, focusing on semantic generation rather than pixel generation.

Result: Three key findings: (1) Generation improves understanding only when generating semantics, not pixels; (2) Generation shows superior data scaling trends and higher data utilization; (3) Autoregression on input embeddings effectively captures visual details.

Conclusion: Semantic generation can enhance visual understanding in unified vision-language models, with better data scaling efficiency and effective detail capture through autoregressive input embedding modeling.

Abstract: Vision-language large models are moving toward the unification of visual understanding and visual generation tasks. However, whether generation can enhance understanding is still under-explored on large data scale. In this work, we analysis the unified model with a concise structure, UniHetero, under large-scale pretraining (>200M samples). Our key observations are: (1) Generation can improve understanding, but Only if you generate Semantics, Not Pixels. (2) Generation reveals a superior Data Scaling trend and higher Data Utilization. (3) Autoregression on Input Embedding is effective to capture visual details.

[60] Single LLM Debate, MoLaCE: Mixture of Latent Concept Experts Against Confirmation Bias

Hazel Kim, Philip Torr

Main category: cs.CL

TL;DR: MoLaCE is a lightweight inference-time framework that reduces LLM confirmation bias by mixing experts as different activation strengths over latent concepts, enabling single models to emulate multi-agent debate benefits internally.

Details

Motivation: LLMs suffer from input confirmation bias where they reinforce preferred answers in prompts rather than exploring alternatives. This is especially problematic in multi-agent debates where echo chambers amplify rather than correct bias.

Method: MoLaCE (Mixture of Latent Concept Experts) uses different activation strengths over latent concepts that shape model responses. It recognizes that differently phrased prompts reweight latent concepts in prompt-specific ways affecting factual correctness, so no single fixed intervention works universally.

Result: MoLaCE consistently reduces confirmation bias, improves robustness, and matches or surpasses multi-agent debate performance while requiring only a fraction of the computation. It can also be integrated into multi-agent frameworks to diversify perspectives.

Conclusion: MoLaCE provides an efficient, scalable solution to LLM confirmation bias by enabling single models to internally emulate debate benefits through latent concept manipulation, addressing a critical vulnerability in current LLM systems.

Abstract: Large language models (LLMs) are highly vulnerable to input confirmation bias. When a prompt implies a preferred answer, models often reinforce that bias rather than explore alternatives. This phenomenon remains underexplored, yet it is already harmful in base models and poses an even greater risk in multi-agent debate, where echo chambers reinforce bias instead of correction. We introduce Mixture of Latent Concept Experts (MoLaCE), a lightweight inference-time framework that addresses confirmation bias by mixing experts instantiated as different activation strengths over latent concepts that shape model responses. Our key insight is that, due to the compositional nature of language, differently phrased prompts reweight latent concepts in prompt-specific ways that affect factual correctness, so no single fixed intervention can be applied universally across inputs. This design enables a single LLM to emulate the benefits of debate internally while remaining computationally efficient and scalable. It can also be integrated into multi-agent debate frameworks to diversify perspectives and reduce correlated errors. We empirically show that it consistently reduces confirmation bias, improves robustness, and matches or surpasses multi-agent debate while requiring only a fraction of the computation.

[61] Lie to Me: Knowledge Graphs for Robust Hallucination Self-Detection in LLMs

Sahil Kale, Antonio Luca Alfeo

Main category: cs.CL

TL;DR: A simple yet effective method that converts LLM responses into knowledge graphs to improve hallucination self-detection, achieving up to 16% accuracy and 20% F1-score improvements over existing methods.

Details

Motivation: Hallucinations in LLMs remain a major barrier to safe deployment. While self-detection methods show promise, there's room for improvement by leveraging structured knowledge representations to better identify false statements.

Method: Proposes a two-step approach: (1) convert LLM responses into knowledge graphs of entities and relations, (2) use these graphs to estimate hallucination likelihood. The method is low-cost and model-agnostic.

Result: Evaluated on GPT-4o and Gemini-2.5-Flash across two hallucination datasets. Achieves up to 16% relative improvement in accuracy and 20% in F1-score compared to standard self-detection methods and state-of-the-art SelfCheckGPT.

Conclusion: LLMs can better analyze atomic facts when structured as knowledge graphs, even when initial outputs contain inaccuracies. This approach paves the way toward safer, more trustworthy language models and includes a manually curated enhanced dataset for future benchmarking.

Abstract: Hallucinations, the generation of apparently convincing yet false statements, remain a major barrier to the safe deployment of LLMs. Building on the strong performance of self-detection methods, we examine the use of structured knowledge representations, namely knowledge graphs, to improve hallucination self-detection. Specifically, we propose a simple yet powerful approach that enriches hallucination self-detection by (i) converting LLM responses into knowledge graphs of entities and relations, and (ii) using these graphs to estimate the likelihood that a response contains hallucinations. We evaluate the proposed approach using two widely used LLMs, GPT-4o and Gemini-2.5-Flash, across two hallucination detection datasets. To support more reliable future benchmarking, one of these datasets has been manually curated and enhanced and is released as a secondary outcome of this work. Compared to standard self-detection methods and SelfCheckGPT, a state-of-the-art approach, our method achieves up to 16% relative improvement in accuracy and 20% in F1-score. Our results show that LLMs can better analyse atomic facts when they are structured as knowledge graphs, even when initial outputs contain inaccuracies. This low-cost, model-agnostic approach paves the way toward safer and more trustworthy language models.

[62] Instruction-Following Evaluation of Large Vision-Language Models

Daiki Shiono, Shumpei Miyawaki, Ryota Tanaka, Jun Suzuki

Main category: cs.CL

TL;DR: LVLMs lose instruction-following ability after visual instruction tuning; including output format specifications in training data helps mitigate this decline.

Details

Motivation: Large Vision-Language Models (LVLMs) often fail to maintain the instruction-following capabilities of their underlying LLMs after visual instruction tuning, leading to poor task performance despite visual understanding.

Method: Constructed new training datasets highlighting output format specifications, then investigated how explicitly indicating output format during fine-tuning affects LVLMs’ instruction-following ability through quantitative evaluation.

Result: Quantitative evaluation confirmed that LVLMs’ instruction-following ability declines after fine-tuning with commonly used datasets. Models trained with datasets including output format instructions follow instructions more accurately than those without.

Conclusion: Including samples with instructions on output format during visual instruction tuning may help mitigate the decline in instruction-following abilities in LVLMs.

Abstract: Following the initial flourishing of large language models (LLMs), there has been a surge in proposed large vision-language models (LVLMs) that integrate LLMs with vision capabilities. However, it has been observed that LVLMs, after tuning to visual instruction using commonly used training datasets, often fail to exhibit the instruction-following ability that was present in the LLM before integration, leading to results in which they do not follow task instructions as expected. This study quantitatively demonstrates that LVLMs’ instruction-following ability declines after fine-tuning and analyzes its underlying causes. In particular, we constructed new training datasets highlighting whether the output format is specified. Then, we investigated how explicitly indicating the output format during fine-tuning affects LVLMs’ instruction-following ability. Our quantitative evaluation confirmed that LVLMs’ instruction-following ability declines after fine-tuning with commonly used datasets. Furthermore, we found that LVLMs trained with datasets, including instructions on output format, tend to follow instructions more accurately than models that do not. These findings suggest that including samples with instructions on output format during (visual) instruction tuning may help mitigate the decline in instruction-following abilities.

[63] Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing

Yuwen Li, Wei Zhang, Zelong Huang, Mason Yang, Jiajun Wu, Shawn Guo, Huahao Hu, Lingyi Sun, Jian Yang, Mingjie Tang, Byran Dai

Main category: cs.CL

TL;DR: InfTool is an autonomous multi-agent framework that generates synthetic tool-calling trajectories from API specs, enabling LLMs to learn tool invocation without human annotation through iterative self-improvement cycles.

Details

Motivation: Current approaches for enabling LLMs to invoke external tools face three key challenges: 1) expensive human annotation for high-quality training data, 2) poor generalization to unseen tools, and 3) quality ceilings from single-model synthesis that perpetuate biases and coverage gaps.

Method: InfTool uses a self-evolving multi-agent synthesis framework with three collaborative agents: User Simulator, Tool-Calling Assistant, and MCP Server. Given raw API specifications, it generates diverse, verified trajectories from single-turn calls to complex multi-step workflows. The framework establishes a closed loop where synthesized data trains the model via Group Relative Policy Optimization (GRPO) with gated rewards, and the improved model generates higher-quality data targeting capability gaps, iterating without human intervention.

Result: On the Berkeley Function-Calling Leaderboard (BFCL), InfTool transforms a base 32B model from 19.8% to 70.9% accuracy (+258% improvement). This performance surpasses models 10x larger and rivals Claude-Opus, achieved entirely from synthetic data without any human annotation.

Conclusion: InfTool demonstrates that fully autonomous, self-evolving multi-agent synthesis can overcome the fundamental challenges in tool-calling LLMs, enabling dramatic performance improvements without human annotation and achieving state-of-the-art results through iterative self-improvement cycles.

Abstract: Enabling Large Language Models (LLMs) to reliably invoke external tools remains a critical bottleneck for autonomous agents. Existing approaches suffer from three fundamental challenges: expensive human annotation for high-quality trajectories, poor generalization to unseen tools, and quality ceilings inherent in single-model synthesis that perpetuate biases and coverage gaps. We introduce InfTool, a fully autonomous framework that breaks these barriers through self-evolving multi-agent synthesis. Given only raw API specifications, InfTool orchestrates three collaborative agents (User Simulator, Tool-Calling Assistant, and MCP Server) to generate diverse, verified trajectories spanning single-turn calls to complex multi-step workflows. The framework establishes a closed loop: synthesized data trains the model via Group Relative Policy Optimization (GRPO) with gated rewards, the improved model generates higher-quality data targeting capability gaps, and this cycle iterates without human intervention. Experiments on the Berkeley Function-Calling Leaderboard (BFCL) demonstrate that InfTool transforms a base 32B model from 19.8% to 70.9% accuracy (+258%), surpassing models 10x larger and rivaling Claude-Opus, and entirely from synthetic data without human annotation.

[64] A Dataset and Benchmark for Consumer Healthcare Question Summarization

Abhishek Basu, Deepak Gupta, Dina Demner-Fushman, Shweta Yadav

Main category: cs.CL

TL;DR: CHQ-Sum dataset: 1507 domain-expert annotated consumer health questions with summaries to advance healthcare question summarization research.

Details

Motivation: Consumer health questions on the web are often overly descriptive and contain peripheral information, making natural language understanding challenging. There's a lack of domain-expert annotated datasets for healthcare question summarization, inhibiting development of effective summarization systems.

Method: Created CHQ-Sum dataset containing 1507 domain-expert annotated consumer health questions and corresponding summaries derived from community question answering forums. Benchmarked the dataset on multiple state-of-the-art summarization models.

Result: Introduced a new valuable resource for understanding consumer health-related posts on social media. Provided benchmarks showing the dataset’s effectiveness for training and evaluating summarization models.

Conclusion: CHQ-Sum addresses the critical gap in domain-expert annotated healthcare question summarization datasets, enabling development of more efficient summarization systems for consumer health information.

Abstract: The quest for seeking health information has swamped the web with consumers health-related questions. Generally, consumers use overly descriptive and peripheral information to express their medical condition or other healthcare needs, contributing to the challenges of natural language understanding. One way to address this challenge is to summarize the questions and distill the key information of the original question. Recently, large-scale datasets have significantly propelled the development of several summarization tasks, such as multi-document summarization and dialogue summarization. However, a lack of a domain-expert annotated dataset for the consumer healthcare questions summarization task inhibits the development of an efficient summarization system. To address this issue, we introduce a new dataset, CHQ-Sum,m that contains 1507 domain-expert annotated consumer health questions and corresponding summaries. The dataset is derived from the community question answering forum and therefore provides a valuable resource for understanding consumer health-related posts on social media. We benchmark the dataset on multiple state-of-the-art summarization models to show the effectiveness of the dataset

[65] Less is more: Probabilistic reduction is best explained by small-scale predictability measures

Cassandra L. Jacobs, Andrés Buxó-Lugo, Anna K. Taylor, Marie Leopold-Hooke

Main category: cs.CL

TL;DR: The paper examines how much linguistic context is needed to study relationships between language model probabilities and cognitive phenomena, finding that n-gram representations work as well as whole utterances for observing probabilistic reduction effects.

Details

Motivation: To determine the appropriate level of linguistic context needed when investigating connections between language model probability patterns and cognitive processing phenomena like planning and reduction effects.

Method: Investigates whether whole utterances are necessary by comparing them with n-gram representations as cognitive units of planning, testing different context sizes for observing probabilistic reduction effects.

Result: Demonstrates that n-gram representations suffice as cognitive units of planning, meaning whole utterances are not necessary to observe probabilistic reduction effects in language model probabilities.

Conclusion: Simpler n-gram representations can effectively serve as cognitive planning units when studying relationships between language model probabilities and cognitive phenomena, reducing the context requirements for such investigations.

Abstract: The primary research questions of this paper center on defining the amount of context that is necessary and/or appropriate when investigating the relationship between language model probabilities and cognitive phenomena. We investigate whether whole utterances are necessary to observe probabilistic reduction and demonstrate that n-gram representations suffice as cognitive units of planning.

[66] Multilingual Hidden Prompt Injection Attacks on LLM-Based Academic Reviewing

Panagiotis Theocharopoulos, Ajinkya Kulkarni, Mathew Magimai. -Doss

Main category: cs.CL

TL;DR: LLM-based academic peer review systems are vulnerable to document-level hidden prompt injection attacks, with varying susceptibility across different languages.

Details

Motivation: As LLMs are increasingly used in high-impact workflows like academic peer review, there's a need to understand their vulnerability to hidden prompt injection attacks that could manipulate review outcomes.

Method: Constructed dataset of ~500 real ICML-accepted papers, injected each with semantically equivalent adversarial prompts in four languages (English, Japanese, Chinese, Arabic), then reviewed using an LLM to measure impact on scores and decisions.

Result: Prompt injection caused substantial changes in review scores and accept/reject decisions for English, Japanese, and Chinese injections, but little to no effect for Arabic injections, revealing language-dependent vulnerabilities.

Conclusion: LLM-based reviewing systems are susceptible to document-level prompt injection attacks, with significant differences in vulnerability across languages, highlighting security risks in high-stakes applications.

Abstract: Large language models (LLMs) are increasingly considered for use in high-impact workflows, including academic peer review. However, LLMs are vulnerable to document-level hidden prompt injection attacks. In this work, we construct a dataset of approximately 500 real academic papers accepted to ICML and evaluate the effect of embedding hidden adversarial prompts within these documents. Each paper is injected with semantically equivalent instructions in four different languages and reviewed using an LLM. We find that prompt injection induces substantial changes in review scores and accept/reject decisions for English, Japanese, and Chinese injections, while Arabic injections produce little to no effect. These results highlight the susceptibility of LLM-based reviewing systems to document-level prompt injection and reveal notable differences in vulnerability across languages.

[67] Fine-Tuning LLMs with Fine-Grained Human Feedback on Text Spans

Sky CH-Wang, Justin Svegliato, Helen Appel, Jason Eisner

Main category: cs.CL

TL;DR: A method for fine-tuning language models using feedback-driven improvement chains where annotators mark liked/disliked spans, models rewrite disliked spans incrementally, and preference pairs are created from adjacent steps for more effective alignment.

Details

Motivation: To improve preference tuning of language models by moving beyond standard A/B preference ranking or full contrastive rewrites, enabling more efficient and effective learning from localized, targeted edits through structured revision-based supervision.

Method: 1) Annotators provide fine-grained feedback by marking “liked” and “disliked” spans in model responses and specifying what they liked/disliked. 2) The base model rewrites disliked spans from left to right, creating incremental improvement chains. 3) Preference pairs are constructed from each adjacent step in the chain for direct alignment training.

Result: The approach outperforms direct alignment methods based on standard A/B preference ranking or full contrastive rewrites, demonstrating that structured, revision-based supervision leads to more efficient and effective preference tuning.

Conclusion: Feedback-driven improvement chains with fine-grained span-level feedback and incremental rewriting provide superior preference tuning compared to traditional methods, enabling models to learn from localized edits in a structured revision process.

Abstract: We present a method and dataset for fine-tuning language models with preference supervision using feedback-driven improvement chains. Given a model response, an annotator provides fine-grained feedback by marking liked'' and disliked’’ spans and specifying what they liked or disliked about them. The base model then rewrites the disliked spans accordingly, proceeding from left to right, forming a sequence of incremental improvements. We construct preference pairs for direct alignment from each adjacent step in the chain, enabling the model to learn from localized, targeted edits. We find that our approach outperforms direct alignment methods based on standard A/B preference ranking or full contrastive rewrites, demonstrating that structured, revision-based supervision leads to more efficient and effective preference tuning.

[68] Eliciting Behaviors in Multi-Turn Conversations

Jing Huang, Shujian Zhang, Lun Wang, Andrew Hard, Rajiv Mathews, John Lambert

Main category: cs.CL

TL;DR: Online behavior elicitation methods outperform static approaches for discovering failure cases in multi-turn LLM conversations, achieving up to 77% success rate with few thousand queries.

Details

Motivation: Existing behavior elicitation methods for LLMs focus mainly on single-turn settings, but real-world conversational AI operates in multi-turn contexts. There's a need to evaluate LLMs in dynamic, multi-turn conversations where complex behaviors emerge over time.

Method: 1) Developed analytical framework categorizing methods into three families: prior knowledge only, offline interactions, and online interactions. 2) Introduced generalized multi-turn formulation of online methods unifying single-turn and multi-turn elicitation. 3) Evaluated all three method families on automatically generating multi-turn test cases, analyzing trade-off between query budget and success rate.

Result: Online methods achieved average success rates of 45%, 19%, and 77% across three tasks with just a few thousand queries. Static methods from existing multi-turn conversation benchmarks found few or no failure cases, demonstrating online methods’ superior efficiency in discovering behavior-eliciting inputs.

Conclusion: Behavior elicitation methods are valuable for multi-turn conversation evaluation, and the community should move towards dynamic benchmarks rather than relying on static test cases. Online interaction-based methods are particularly effective for discovering failure cases in conversational LLMs.

Abstract: Identifying specific and often complex behaviors from large language models (LLMs) in conversational settings is crucial for their evaluation. Recent work proposes novel techniques to find natural language prompts that induce specific behaviors from a target model, yet they are mainly studied in single-turn settings. In this work, we study behavior elicitation in the context of multi-turn conversations. We first offer an analytical framework that categorizes existing methods into three families based on their interactions with the target model: those that use only prior knowledge, those that use offline interactions, and those that learn from online interactions. We then introduce a generalized multi-turn formulation of the online method, unifying single-turn and multi-turn elicitation. We evaluate all three families of methods on automatically generating multi-turn test cases. We investigate the efficiency of these approaches by analyzing the trade-off between the query budget, i.e., the number of interactions with the target model, and the success rate, i.e., the discovery rate of behavior-eliciting inputs. We find that online methods can achieve an average success rate of 45/19/77% with just a few thousand queries over three tasks where static methods from existing multi-turn conversation benchmarks find few or even no failure cases. Our work highlights a novel application of behavior elicitation methods in multi-turn conversation evaluation and the need for the community to move towards dynamic benchmarks.

Yunxin Li, Zhenyu Liu, Baotian Hu, Wei Wang, Yuxin Ding, Xiaochun Cao, Min Zhang

Main category: cs.CL

TL;DR: MKS2 enhances LLMs by storing visual knowledge in modular memory and using multimodal experts for knowledge sharing during text generation.

Details

Motivation: Current MLLMs use LLMs for vision tasks but fail to leverage visual knowledge to enhance LLMs' overall capabilities. The paper aims to enhance LLMs by incorporating multimodal knowledge storage and sharing.

Method: Proposes MKS2 with two components: 1) Modular Visual Memory (MVM) integrated into LLM blocks to store open-world visual information, and 2) soft Mixture of Multimodal Experts (MoMEs) architecture to invoke multimodal knowledge collaboration during text generation.

Result: MKS2 substantially enhances LLMs’ reasoning capabilities in contexts requiring physical or commonsense knowledge, and achieves competitive results on image-text understanding multimodal benchmarks.

Conclusion: The proposed approach successfully enhances LLMs by enabling multimodal knowledge storage and sharing, demonstrating improved reasoning and competitive multimodal understanding performance.

Abstract: Recent advancements in multimodal large language models (MLLMs) have achieved significant multimodal generation capabilities, akin to GPT-4. These models predominantly map visual information into language representation space, leveraging the vast knowledge and powerful text generation abilities of LLMs to produce multimodal instruction-following responses. We could term this method as LLMs for Vision because of its employing LLMs for visual understanding and reasoning, yet observe that these MLLMs neglect the potential of harnessing visual knowledge to enhance the overall capabilities of LLMs, which could be regarded as Vision Enhancing LLMs. In this paper, we propose an approach called MKS2, aimed at enhancing LLMs through empowering Multimodal Knowledge Storage and Sharing in LLMs. Specifically, we introduce Modular Visual Memory (MVM), a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently. Additionally, we present a soft Mixture of Multimodal Experts (MoMEs) architecture in LLMs to invoke multimodal knowledge collaboration during text generation. Our comprehensive experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge. It also delivers competitive results on image-text understanding multimodal benchmarks. The codes will be available at: https://github.com/HITsz-TMG/MKS2-Multimodal-Knowledge-Storage-and-Sharing

[70] Patience Is The Key to Large Language Model Reasoning

Yijiong Yu

Main category: cs.CL

TL;DR: A method to improve LLM reasoning by training models to be more patient and thorough using preference optimization on lightweight datasets, achieving 2.1% gains on GSM8k.

Details

Motivation: Current LLMs either sacrifice detailed reasoning for brevity (due to user preferences) or require extensive expensive training to learn complex reasoning, limiting their ability to solve complex tasks.

Method: Propose a simple test-time scaling method that encourages models to adopt more patient reasoning without new knowledge/skills. Use preference optimization: generate detailed reasoning as positive examples and simple answers as negative examples, training models to favor thorough responses.

Result: Achieves performance increase of up to 2.1% on GSM8k benchmark with training on just a lightweight dataset.

Conclusion: The approach effectively bridges the gap between brevity and thorough reasoning by training models to be more patient, achieving significant performance gains with minimal training data requirements.

Abstract: Recent advancements in the field of large language models, particularly through the Chain of Thought (CoT) approach, have demonstrated significant improvements in solving complex problems. However, existing models either tend to sacrifice detailed reasoning for brevity due to user preferences, or require extensive and expensive training data to learn complicated reasoning ability, limiting their potential in solving complex tasks. To bridge this gap, following the concept of scaling test-time, we propose a simple method by encouraging models to adopt a more patient reasoning style without the need of introducing new knowledge or skills. To employ a preference optimization approach, we generate detailed reasoning processes as positive examples and simple answers as negative examples, thereby training the model to favor thoroughness in its responses. Our results demonstrate a performance increase of up to 2.1% on GSM8k with training just on a lightweight dataset.

[71] The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models

Jonathan Katzy, Razvan Mihai Popescu, Arie van Deursen, Maliheh Izadi

Main category: cs.CL

TL;DR: The Heap is a large multilingual code dataset (57 languages) deduplicated against other open datasets to enable fair LLM evaluations without data contamination concerns.

Details

Motivation: The popularity of LLMs has led to extensive code datasets being used for training, leaving limited clean code available for downstream evaluation without data contamination issues.

Method: Created The Heap dataset covering 57 programming languages with deduplication against other open code datasets to ensure uniqueness.

Result: Released a large multilingual code dataset that enables researchers to conduct fair evaluations of LLMs without significant data cleaning overhead.

Conclusion: The Heap addresses the data contamination problem in LLM evaluation by providing a deduplicated code dataset that supports fair assessment of model performance.

Abstract: The recent rise in the popularity of large language models has spurred the development of extensive code datasets needed to train them. This has left limited code available for collection and use in the downstream investigation of specific behaviors, or evaluation of large language models without suffering from data contamination. To address this problem, we release The Heap, a large multilingual dataset covering 57 programming languages that has been deduplicated with respect to other open datasets of code, enabling researchers to conduct fair evaluations of large language models without significant data cleaning overhead.

[72] Topic-FlipRAG: Topic-Orientated Adversarial Opinion Manipulation Attacks to Retrieval-Augmented Generation Models

Yuyang Gong, Zhuo Chen, Jiawei Liu, Miaokun Chen, Fengchang Yu, Wei Lu, Xiaofeng Wang, Xiaozhong Liu

Main category: cs.CL

TL;DR: Topic-FlipRAG is a two-stage attack pipeline that manipulates RAG systems’ opinions on specific topics through semantic-level perturbations, bypassing current defenses.

Details

Motivation: RAG systems based on LLMs are increasingly influential in shaping public opinion and information dissemination, yet current security research focuses mainly on factual or single-query attacks, leaving more practical topic-oriented opinion manipulation vulnerabilities unaddressed.

Method: Topic-FlipRAG uses a two-stage manipulation pipeline combining traditional adversarial ranking attacks with LLMs’ internal knowledge and reasoning capabilities to craft semantic-level perturbations that influence opinions across related queries.

Result: The attacks effectively shift model outputs’ opinions on specific topics, significantly impacting user information perception, and current mitigation methods cannot defend against such attacks.

Conclusion: The research highlights critical vulnerabilities in RAG systems, demonstrates the need for enhanced safeguards, and provides crucial insights for LLM security research against sophisticated opinion manipulation attacks.

Abstract: Retrieval-Augmented Generation (RAG) systems based on Large Language Models (LLMs) have become essential for tasks such as question answering and content generation. However, their increasing impact on public opinion and information dissemination has made them a critical focus for security research due to inherent vulnerabilities. Previous studies have predominantly addressed attacks targeting factual or single-query manipulations. In this paper, we address a more practical scenario: topic-oriented adversarial opinion manipulation attacks on RAG models, where LLMs are required to reason and synthesize multiple perspectives, rendering them particularly susceptible to systematic knowledge poisoning. Specifically, we propose Topic-FlipRAG, a two-stage manipulation attack pipeline that strategically crafts adversarial perturbations to influence opinions across related queries. This approach combines traditional adversarial ranking attack techniques and leverages the extensive internal relevant knowledge and reasoning capabilities of LLMs to execute semantic-level perturbations. Experiments show that the proposed attacks effectively shift the opinion of the model’s outputs on specific topics, significantly impacting user information perception. Current mitigation methods cannot effectively defend against such attacks, highlighting the necessity for enhanced safeguards for RAG systems, and offering crucial insights for LLM security research.

[73] SelfCheck-Eval: A Multi-Module Framework for Zero-Resource Hallucination Detection in Large Language Models

Diyana Muhammed, Giusy Giulia Tuccari, Gollam Rabby, Sören Auer, Sahar Vahdati

Main category: cs.CL

TL;DR: The paper introduces AIME Math Hallucination dataset and SelfCheck-Eval framework to address LLM hallucinations in mathematical reasoning, showing current methods fail in specialized domains.

Details

Motivation: LLMs generate hallucinations (incorrect/fabricated content) that hinder reliable deployment in high-stakes domains. Current hallucination detection benchmarks are limited to general-knowledge domains and neglect specialized fields like mathematics where accuracy is critical.

Method: 1) Introduce AIME Math Hallucination dataset - first comprehensive benchmark for mathematical reasoning hallucinations. 2) Propose SelfCheck-Eval - LLM-agnostic, black-box hallucination detection framework with multi-module architecture: Semantic module, Specialised Detection module, and Contextual Consistency module.

Result: Evaluation shows systematic performance disparities: existing methods work well on biographical content but struggle significantly with mathematical reasoning. This challenge persists across NLI fine-tuning, preference learning, and process supervision approaches.

Conclusion: Current hallucination detection methods have fundamental limitations in mathematical domains. There’s critical need for specialized, black-box compatible approaches to ensure reliable LLM deployment in high-stakes applications.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse applications, from open-domain question answering to scientific writing, medical decision support, and legal analysis. However, their tendency to generate incorrect or fabricated content, commonly known as hallucinations, represents a critical barrier to reliable deployment in high-stakes domains. Current hallucination detection benchmarks are limited in scope, focusing primarily on general-knowledge domains while neglecting specialised fields where accuracy is paramount. To address this gap, we introduce the AIME Math Hallucination dataset, the first comprehensive benchmark specifically designed for evaluating mathematical reasoning hallucinations. Additionally, we propose SelfCheck-Eval, a LLM-agnostic, black-box hallucination detection framework applicable to both open and closed-source LLMs. Our approach leverages a novel multi-module architecture that integrates three independent detection strategies: the Semantic module, the Specialised Detection module, and the Contextual Consistency module. Our evaluation reveals systematic performance disparities across domains: existing methods perform well on biographical content but struggle significantly with mathematical reasoning, a challenge that persists across NLI fine-tuning, preference learning, and process supervision approaches. These findings highlight the fundamental limitations of current detection methods in mathematical domains and underscore the critical need for specialised, black-box compatible approaches to ensure reliable LLM deployment.

[74] Atom of Thoughts for Markov LLM Test-Time Scaling

Fengwei Teng, Quan Shi, Zhaoyang Yu, Jiayi Zhang, Yuyu Luo, Chenglin Wu, Zhijiang Guo

Main category: cs.CL

TL;DR: Atom of Thoughts (AoT) is a novel reasoning framework that decomposes complex reasoning into atomic units using Markov chains, improving computational efficiency while maintaining performance.

Details

Motivation: Existing LLM test-time scaling methods suffer from redundant computations due to accumulating historical dependency information during inference, creating inefficiencies in reasoning processes.

Method: Leverages Markov process memoryless property to minimize historical context reliance, creates Markovian reasoning chain, integrates with tree search and reflective refinement, decomposes reasoning into self-contained atomic units.

Result: Extensive experiments show AoT consistently outperforms existing baselines as computational budgets increase, works with various LLMs (both reasoning and non-reasoning), and integrates seamlessly with existing reasoning frameworks.

Conclusion: AoT enables scalable, high-performance inference through atomic reasoning decomposition, offering computational efficiency improvements while maintaining or enhancing performance across different LLMs and reasoning frameworks.

Abstract: Large Language Models (LLMs) have achieved significant performance gains through test-time scaling methods. However, existing approaches often incur redundant computations due to the accumulation of historical dependency information during inference. To address this challenge, we leverage the memoryless property of Markov processes to minimize reliance on historical context and propose a Markovian reasoning process. This foundational Markov chain structure enables seamless integration with various test-time scaling methods, thereby improving their scaling efficiency. By further scaling up the Markovian reasoning chain through integration with techniques such as tree search and reflective refinement, we uncover an emergent atomic reasoning structure, where reasoning trajectories are decomposed into a series of self-contained, low-complexity atomic units. We name this design Atom of Thoughts (\our). Extensive experiments demonstrate that \our consistently outperforms existing baselines as computational budgets increase. Importantly, \our integrates seamlessly with existing reasoning frameworks and different LLMs (both reasoning and non-reasoning), facilitating scalable, high-performance inference.We submit our code alongside this paper and will make it publicly available to facilitate reproducibility and future research.

[75] Who Writes What: Unveiling the Impact of Author Roles on AI-generated Text Detection

Jiatao Li, Xiaojun Wan

Main category: cs.CL

TL;DR: AI text detectors show significant biases based on author characteristics like language proficiency and environment, requiring more socially aware detection systems.

Details

Motivation: Current AI text detection approaches overlook how author sociolinguistic attributes (gender, language proficiency, academic field, language environment) affect detection accuracy, potentially leading to unfair penalization of specific demographic groups.

Method: Used ICNALE corpus of human-authored texts and parallel AI-generated texts from diverse LLMs, evaluated with multi-factor ANOVA and weighted least squares (WLS) statistical framework to analyze detector biases.

Result: CEFR proficiency and language environment consistently affected detector accuracy, while gender and academic field showed detector-dependent effects, revealing significant biases in current AI text detection systems.

Conclusion: The study highlights the need for socially aware AI text detection to avoid unfair penalization, provides empirical evidence and statistical framework for developing more equitable detection systems, and paves way for bias mitigation research.

Abstract: The rise of Large Language Models (LLMs) necessitates accurate AI-generated text detection. However, current approaches largely overlook the influence of author characteristics. We investigate how sociolinguistic attributes-gender, CEFR proficiency, academic field, and language environment-impact state-of-the-art AI text detectors. Using the ICNALE corpus of human-authored texts and parallel AI-generated texts from diverse LLMs, we conduct a rigorous evaluation employing multi-factor ANOVA and weighted least squares (WLS). Our results reveal significant biases: CEFR proficiency and language environment consistently affected detector accuracy, while gender and academic field showed detector-dependent effects. These findings highlight the crucial need for socially aware AI text detection to avoid unfairly penalizing specific demographic groups. We offer novel empirical evidence, a robust statistical framework, and actionable insights for developing more equitable and reliable detection systems in real-world, out-of-domain contexts. This work paves the way for future research on bias mitigation, inclusive evaluation benchmarks, and socially responsible LLM detectors.

[76] Forecasting Clinical Risk from Textual Time Series: Structuring Narratives for Temporal AI in Healthcare

Shahriar Noroozizadeh, Sayantan Kumar, Jeremy C. Weiss

Main category: cs.CL

TL;DR: The paper introduces forecasting from textual time series using LLM-extracted clinical findings, showing encoder-based models excel at event prediction while decoder models perform better at survival analysis, highlighting the importance of time ordering over text ordering.

Details

Motivation: Clinical case reports contain valuable temporal patient trajectories that are underutilized by traditional ML methods relying on structured data. There's a need to better leverage temporal information from clinical text for forecasting tasks.

Method: Used LLM-assisted annotation pipeline to extract timestamped clinical findings from text. Evaluated diverse models including fine-tuned decoder-based LLMs and encoder-based transformers on event occurrence prediction, temporal ordering, and survival analysis tasks.

Result: Encoder-based models achieved higher F1 scores and better temporal concordance for short- and long-horizon event forecasting. Fine-tuned masking approaches improved ranking performance. Instruction-tuned decoder models performed better at survival analysis, especially for early prognosis. Time ordering proved more important than text ordering.

Conclusion: Time-ordered clinical corpora provide additional benefits beyond text ordering for temporal tasks, with encoder models excelling at event forecasting and decoder models at survival analysis, highlighting the value of temporal structure in the LLM era.

Abstract: Clinical case reports encode temporal patient trajectories that are often underexploited by traditional machine learning methods relying on structured data. In this work, we introduce the forecasting problem from textual time series, where timestamped clinical findings – extracted via an LLM-assisted annotation pipeline – serve as the primary input for prediction. We systematically evaluate a diverse suite of models, including fine-tuned decoder-based large language models and encoder-based transformers, on tasks of event occurrence prediction, temporal ordering, and survival analysis. Our experiments reveal that encoder-based models consistently achieve higher F1 scores and superior temporal concordance for short- and long-horizon event forecasting, while fine-tuned masking approaches enhance ranking performance. In contrast, instruction-tuned decoder models demonstrate a relative advantage in survival analysis, especially in early prognosis settings. Our sensitivity analyses further demonstrate the importance of time ordering, which requires clinical time series construction, as compared to text ordering, the format of the text inputs that LLMs are classically trained on. This highlights the additional benefit that can be ascertained from time-ordered corpora, with implications for temporal tasks in the era of widespread LLM use.

Jiatao Li, Yanheng Li, Xiaojun Wan

Main category: cs.CL

TL;DR: This paper introduces the Social Worldview Taxonomy (SWT) to evaluate LLMs’ implicit socio-cognitive attitudes (worldviews) toward authority, equality, autonomy, and fate, showing these attitudes are adaptable to social cues rather than fixed biases.

Details

Motivation: LLMs significantly influence social interactions and information dissemination, but previous studies focus on fixed demographic/ethical biases rather than deeper cognitive orientations that adapt to dynamic social contexts.

Method: Developed Social Worldview Taxonomy (SWT) based on Cultural Theory to operationalize four canonical worldviews (Hierarchy, Egalitarianism, Individualism, Fatalism) into quantifiable dimensions. Analyzed 28 diverse LLMs and used Social Referencing Theory principles to test how explicit social cues modulate worldview profiles.

Result: Identified distinct cognitive profiles reflecting intrinsic model-specific socio-cognitive structures. Found that explicit social cues systematically modulate these profiles, revealing robust patterns of cognitive adaptability in LLMs.

Conclusion: The study provides insights into LLMs’ latent cognitive flexibility and offers practical pathways for developing more transparent, interpretable, and socially responsible AI systems by understanding their adaptable worldview structures.

Abstract: Large Language Models significantly influence social interactions, decision-making, and information dissemination, underscoring the need to understand the implicit socio-cognitive attitudes, referred to as “worldviews”, encoded within these systems. Unlike previous studies predominantly addressing demographic and ethical biases as fixed attributes, our study explores deeper cognitive orientations toward authority, equality, autonomy, and fate, emphasizing their adaptability in dynamic social contexts. We introduce the Social Worldview Taxonomy (SWT), an evaluation framework grounded in Cultural Theory, operationalizing four canonical worldviews, namely Hierarchy, Egalitarianism, Individualism, and Fatalism, into quantifiable sub-dimensions. Through extensive analysis of 28 diverse LLMs, we identify distinct cognitive profiles reflecting intrinsic model-specific socio-cognitive structures. Leveraging principles from Social Referencing Theory, our experiments demonstrate that explicit social cues systematically modulate these profiles, revealing robust patterns of cognitive adaptability. Our findings provide insights into the latent cognitive flexibility of LLMs and offer computational scientists practical pathways toward developing more transparent, interpretable, and socially responsible AI systems

[78] DIF: A Framework for Benchmarking and Verifying Implicit Bias in LLMs

Lake Yin, Fan Huang

Main category: cs.CL

TL;DR: The paper introduces DIF (Demographic Implicit Fairness), a benchmark method to measure implicit bias in LLMs by evaluating their performance on logic/math problems with sociodemographic personas, revealing an inverse relationship between accuracy and bias.

Details

Motivation: There's growing concern about biases in LLMs inherited from training data, and while previous studies have shown implicit bias in LLM responses to different social contexts, there are no standard methods to benchmark this specific type of bias. The authors argue that implicit bias is both an ethical and technical issue, revealing LLMs' inability to properly handle extraneous information.

Method: Developed DIF (Demographic Implicit Fairness) benchmark by evaluating preexisting LLM logic and math problem datasets with sociodemographic personas, combined with a statistical robustness check using a null model. This creates an easily interpretable benchmark for measuring implicit bias.

Result: The method successfully validates the presence of implicit bias in LLM behavior and discovers a novel inverse trend between question answering accuracy and implicit bias - as accuracy increases, implicit bias decreases, supporting the authors’ argument about the technical nature of this bias.

Conclusion: The DIF benchmark provides a standardized way to measure implicit bias in LLMs, revealing it as both an ethical concern and a technical limitation. The inverse relationship between accuracy and bias suggests that improving LLM reasoning capabilities may help reduce implicit bias.

Abstract: As Large Language Models (LLMs) have risen in prominence over the past few years, there has been concern over the potential biases in LLMs inherited from the training data. Previous studies have examined how LLMs exhibit implicit bias, such as when response generation changes when different social contexts are introduced. We argue that this implicit bias is not only an ethical, but also a technical issue, as it reveals an inability of LLMs to accommodate extraneous information. However, unlike other measures of LLM intelligence, there are no standard methods to benchmark this specific subset of LLM bias. To bridge this gap, we developed a method for calculating an easily interpretable benchmark, DIF (Demographic Implicit Fairness), by evaluating preexisting LLM logic and math problem datasets with sociodemographic personas, which is combined with a statistical robustness check using a null model. We demonstrate that this method can validate the presence of implicit bias in LLM behavior and find an novel inverse trend between question answering accuracy and implicit bias, supporting our argument.

[79] To Bias or Not to Bias: Detecting bias in News with bias-detector

Himel Ghosh, Ahmed Mosharafa, Georg Groh

Main category: cs.CL

TL;DR: Fine-tuning RoBERTa on BABE dataset achieves statistically significant improvements in sentence-level media bias detection over baseline models, with attention analysis showing meaningful contextual understanding.

Details

Motivation: Media bias detection is crucial for fair information dissemination but remains challenging due to subjectivity of bias and scarcity of high-quality annotated data.

Method: Fine-tuned RoBERTa-based model on expert-annotated BABE dataset, used McNemar’s test and 5x2 cross-validation paired t-test for statistical validation, combined with attention-based analysis and pipeline integration with existing bias-type classifier.

Result: Statistically significant performance improvements over DA-RoBERTa baseline, model avoids oversensitivity to politically charged terms and attends to contextually relevant tokens, exhibits good generalization and interpretability despite dataset limitations.

Conclusion: The approach contributes to building robust, explainable, and socially responsible NLP systems for media bias detection, with future directions including context-aware modeling, bias neutralization, and advanced bias type classification.

Abstract: Media bias detection is a critical task in ensuring fair and balanced information dissemination, yet it remains challenging due to the subjectivity of bias and the scarcity of high-quality annotated data. In this work, we perform sentence-level bias classification by fine-tuning a RoBERTa-based model on the expert-annotated BABE dataset. Using McNemar’s test and the 5x2 cross-validation paired t-test, we show statistically significant improvements in performance when comparing our model to a domain-adaptively pre-trained DA-RoBERTa baseline. Furthermore, attention-based analysis shows that our model avoids common pitfalls like oversensitivity to politically charged terms and instead attends more meaningfully to contextually relevant tokens. For a comprehensive examination of media bias, we present a pipeline that combines our model with an already-existing bias-type classifier. Our method exhibits good generalization and interpretability, despite being constrained by sentence-level analysis and dataset size because of a lack of larger and more advanced bias corpora. We talk about context-aware modeling, bias neutralization, and advanced bias type classification as potential future directions. Our findings contribute to building more robust, explainable, and socially responsible NLP systems for media bias detection.

[80] A Large Language Model Based Pipeline for Review of Systems Entity Recognition from Clinical Notes

Hieu Nghiem, Zhuqi Miao, Hemanth Reddy Singareddy, Jivan Lamichhane, Abdulaziz Ahmed, Johnson Thomas, Dursun Delen, William Paiva

Main category: cs.CL

TL;DR: Developed a cost-effective LLM pipeline for extracting Review of Systems entities from clinical notes using open-source models and a novel attribution algorithm for text alignment.

Details

Motivation: To create a scalable, locally deployable solution that reduces ROS documentation burden in healthcare, particularly for resource-limited settings, using open-source LLMs instead of expensive proprietary models.

Method: Pipeline extracts ROS sections using SecTag header terminology, then applies few-shot LLMs (llama3.1:8b, gemma3:27b, mistral3.1:24b, gpt-oss:20b) to identify ROS entities, their status (positive/negative), and body systems. Introduced novel attribution algorithm for aligning LLM-identified entities with source text.

Result: Open-source LLMs delivered promising performance with highest F1 score = 0.952. Larger models (Gemma, Mistral, Gpt-oss) performed robustly across all tasks. Attribution algorithm improved all models’ performance metrics (higher F1, accuracy; lower error). Smaller Llama model achieved good results with only one-third VRAM.

Conclusion: The pipeline provides a scalable, locally deployable solution for ROS documentation. Open-source LLMs offer practical AI for resource-limited healthcare. The attribution algorithm improves zero- and few-shot LLM performance in named entity recognition tasks.

Abstract: Objective: Develop a cost-effective, large language model (LLM)-based pipeline for automatically extracting Review of Systems (ROS) entities from clinical notes. Materials and Methods: The pipeline extracts ROS section from the clinical note using SecTag header terminology, followed by few-shot LLMs to identify ROS entities such as diseases or symptoms, their positive/negative status and associated body systems. We implemented the pipeline using 4 open-source LLM models: llama3.1:8b, gemma3:27b, mistral3.1:24b and gpt-oss:20b. Additionally, we introduced a novel attribution algorithm that aligns LLM-identified ROS entities with their source text, addressing non-exact and synonymous matches. The evaluation was conducted on 24 general medicine notes containing 340 annotated ROS entities. Results: Open-source LLMs enable a local, cost-efficient pipeline while delivering promising performance. Larger models like Gemma, Mistral, and Gpt-oss demonstrate robust performance across three entity recognition tasks of the pipeline: ROS entity extraction, negation detection and body system classification (highest F1 score = 0.952). With the attribution algorithm, all models show improvements across key performance metrics, including higher F1 score and accuracy, along with lower error rate. Notably, the smaller Llama model also achieved promising results despite using only one-third the VRAM of larger models. Discussion and Conclusion: From an application perspective, our pipeline provides a scalable, locally deployable solution to easing the ROS documentation burden. Open-source LLMs offer a practical AI option for resource-limited healthcare settings. Methodologically, our newly developed algorithm facilitates accuracy improvements for zero- and few-shot LLMs in named entity recognition.

[81] Iterative Multilingual Spectral Attribute Erasure

Shun Shao, Yftah Ziser, Zheng Zhao, Yifu Qiu, Shay B. Cohen, Anna Korhonen

Main category: cs.CL

TL;DR: IMSAE is a multilingual debiasing method that identifies and removes joint bias subspaces across multiple languages using iterative SVD truncation, enabling effective debiasing even in zero-shot scenarios where target language data is unavailable.

Details

Motivation: Multilingual representations create opportunities to transfer debiasing effects between languages, but existing methods can't exploit this because they operate on individual languages. There's a need for methods that can leverage cross-lingual semantic alignment for bias mitigation.

Method: Iterative Multilingual Spectral Attribute Erasure (IMSAE) identifies and mitigates joint bias subspaces across multiple languages through iterative SVD-based truncation. It works by finding common bias directions that span multiple languages and removing them.

Result: IMSAE outperforms traditional monolingual and cross-lingual approaches across eight languages and five demographic dimensions. It shows effectiveness in both standard and zero-shot settings, maintaining model utility while reducing bias across diverse language models (BERT, LLaMA, Mistral).

Conclusion: IMSAE successfully leverages multilingual representations to transfer debiasing effects between languages, offering an effective solution for multilingual bias mitigation that works even when target language data is unavailable by using linguistically similar languages.

Abstract: Multilingual representations embed words with similar meanings to share a common semantic space across languages, creating opportunities to transfer debiasing effects between languages. However, existing methods for debiasing are unable to exploit this opportunity because they operate on individual languages. We present Iterative Multilingual Spectral Attribute Erasure (IMSAE), which identifies and mitigates joint bias subspaces across multiple languages through iterative SVD-based truncation. Evaluating IMSAE across eight languages and five demographic dimensions, we demonstrate its effectiveness in both standard and zero-shot settings, where target language data is unavailable, but linguistically similar languages can be used for debiasing. Our comprehensive experiments across diverse language models (BERT, LLaMA, Mistral) show that IMSAE outperforms traditional monolingual and cross-lingual approaches while maintaining model utility.

[82] Improving Large Language Model Safety with Contrastive Representation Learning

Samuel Simko, Mrinmaya Sachan, Bernhard Schölkopf, Zhijing Jin

Main category: cs.CL

TL;DR: Proposes a contrastive representation learning framework for LLM defense using triplet loss with adversarial hard negative mining to separate benign and harmful representations.

Details

Motivation: LLMs are vulnerable to adversarial attacks, and existing defenses struggle to generalize across different attack types. Representation engineering offers promising alternatives for improving model robustness.

Method: Formulates model defense as a contrastive representation learning problem. Uses triplet-based loss combined with adversarial hard negative mining to encourage separation between benign and harmful representations during fine-tuning.

Result: Outperforms prior representation engineering-based defenses, improving robustness against both input-level and embedding-space attacks without compromising standard performance across multiple models.

Conclusion: The contrastive representation learning framework provides an effective defense mechanism that generalizes well across different attack types while maintaining model performance on standard tasks.

Abstract: Large Language Models (LLMs) are powerful tools with profound societal impacts, yet their ability to generate responses to diverse and uncontrolled inputs leaves them vulnerable to adversarial attacks. While existing defenses often struggle to generalize across varying attack types, recent advancements in representation engineering offer promising alternatives. In this work, we propose a defense framework that formulates model defense as a contrastive representation learning (CRL) problem. Our method finetunes a model using a triplet-based loss combined with adversarial hard negative mining to encourage separation between benign and harmful representations. Our experimental results across multiple models demonstrate that our approach outperforms prior representation engineering-based defenses, improving robustness against both input-level and embedding-space attacks without compromising standard performance. Our code is available at https://github.com/samuelsimko/crl-llm-defense

[83] LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

Ming Zhang, Yujiong Shen, Jingyi Deng, Yuhui Wang, Huayu Sha, Kexin Tan, Qiyuan Peng, Yue Zhang, Junzhe Wang, Shichun Liu, Yueyuan Huang, Jingqi Tong, Changhao Jiang, Yilong Wu, Zhihao Zhang, Mingqi Wu, Mingxu Chai, Zhiheng Xi, Shihan Dou, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.CL

TL;DR: LLMEval-Fair is a dynamic evaluation framework for LLMs that uses a proprietary question bank to sample unseen test sets, addressing data contamination and leaderboard overfitting issues in static benchmarks.

Details

Motivation: Static benchmarks for LLM evaluation suffer from data contamination (models being trained on test data) and leaderboard overfitting, which obscure true model capabilities and create misleading performance metrics.

Method: Built on 220k graduate-level questions, LLMEval-Fair dynamically samples unseen test sets for each evaluation. It features contamination-resistant data curation, anti-cheating architecture, and LLM-as-a-judge with 90% human agreement, plus relative ranking for fair comparison.

Result: 30-month study of nearly 60 leading models reveals performance ceiling on knowledge memorization and exposes data contamination vulnerabilities. The framework shows exceptional robustness in ranking stability and consistency.

Conclusion: LLMEval-Fair provides a robust methodology for assessing true LLM capabilities beyond leaderboard scores, promoting more trustworthy evaluation standards through dynamic evaluation paradigm.

Abstract: Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-Fair, a framework for dynamic evaluation of LLMs. LLMEval-Fair is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. A 30-month longitudinal study of nearly 60 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-Fair offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards.

[84] Learning the Topic, Not the Language: How LLMs Classify Online Immigration Discourse Across Languages

Andrea Nasuto, Stefano Maria Iacus, Francisco Rowe, Devika Jain

Main category: cs.CL

TL;DR: Researchers developed a lightweight, open-source LLM framework using fine-tuned LLaMA 3.2-3B models to classify immigration-related tweets across 13 languages, achieving massive speed/cost improvements over commercial LLMs while enabling multilingual social science research.

Details

Motivation: Current use of LLMs in multilingual social science research is limited by model size, cost, and linguistic bias. There's a need for scalable, inclusive analysis of online discourse across different languages and cultural contexts.

Method: Fine-tuned LLaMA 3.2-3B models for topic classification and stance detection on immigration-related tweets across 13 languages. Used minimal data from under-represented languages to correct pretraining biases, avoiding translation pipelines and proprietary systems.

Result: The framework achieved 26-168x faster inference and over 1000x cost savings compared to commercial LLMs. Models fine-tuned in just 1-2 languages could generalize topic understanding to unseen languages, though multilingual fine-tuning improved ideological nuance capture.

Conclusion: This scale-first, open-source framework enables inclusive, reproducible research on public attitudes across linguistic and cultural contexts, supporting real-time analysis of billions of tweets while addressing cost and bias limitations of existing approaches.

Abstract: Large language models (LLMs) offer new opportunities for scalable analysis of online discourse. Yet their use in multilingual social science research remains constrained by model size, cost and linguistic bias. We develop a lightweight, open-source LLM framework using fine-tuned LLaMA 3.2-3B models to classify immigration-related tweets across 13 languages. Unlike prior work relying on BERT style models or translation pipelines, we combine topic classification with stance detection and demonstrate that LLMs fine-tuned in just one or two languages can generalize topic understanding to unseen languages. Capturing ideological nuance, however, benefits from multilingual fine-tuning. Our approach corrects pretraining biases with minimal data from under-represented languages and avoids reliance on proprietary systems. With 26-168x faster inference and over 1000x cost savings compared to commercial LLMs, our method supports real-time analysis of billions of tweets. This scale-first framework enables inclusive, reproducible research on public attitudes across linguistic and cultural contexts.

[85] DySK-Attn: A Framework for Efficient, Real-Time Knowledge Updating in Large Language Models via Dynamic Sparse Knowledge Attention

Kabir Khan, Priya Sharma, Arjun Mehta, Neha Gupta, Ravi Narayanan

Main category: cs.CL

TL;DR: DySK-Attn is a framework that enables LLMs to efficiently integrate real-time knowledge from dynamic external sources using sparse knowledge attention over a knowledge graph.

Details

Motivation: LLMs have static knowledge that quickly becomes outdated, and retraining them is computationally prohibitive. Existing knowledge editing techniques are slow and can introduce side effects.

Method: Synergizes LLM with a dynamic Knowledge Graph that can be updated instantaneously. Uses sparse knowledge attention mechanism for coarse-to-fine grained search to identify relevant facts from KG, avoiding dense attention over entire knowledge base.

Result: Significantly outperforms strong baselines (RAG and model editing techniques) on time-sensitive QA tasks in both factual accuracy for updated knowledge and computational efficiency.

Conclusion: DySK-Attn offers a scalable and effective solution for building LLMs that can stay current with the ever-changing world.

Abstract: Large Language Models (LLMs) suffer from a critical limitation: their knowledge is static and quickly becomes outdated. Retraining these massive models is computationally prohibitive, while existing knowledge editing techniques can be slow and may introduce unforeseen side effects. To address this, we propose DySK-Attn, a novel framework that enables LLMs to efficiently integrate real-time knowledge from a dynamic external source. Our approach synergizes an LLM with a dynamic Knowledge Graph (KG) that can be updated instantaneously. The core of our framework is a sparse knowledge attention mechanism, which allows the LLM to perform a coarse-to-fine grained search, efficiently identifying and focusing on a small, highly relevant subset of facts from the vast KG. This mechanism avoids the high computational cost of dense attention over the entire knowledge base and mitigates noise from irrelevant information. We demonstrate through extensive experiments on time-sensitive question-answering tasks that DySK-Attn significantly outperforms strong baselines, including standard Retrieval-Augmented Generation (RAG) and model editing techniques, in both factual accuracy for updated knowledge and computational efficiency. Our framework offers a scalable and effective solution for building LLMs that can stay current with the ever-changing world.

[86] Leveraging Large Language Models for Rare Disease Named Entity Recognition

Nan Miles Xi, Yu Deng, Lin Wang

Main category: cs.CL

TL;DR: GPT-4o achieves competitive rare disease NER performance using prompt-based strategies, with task-level fine-tuning outperforming BioClinicalBERT baseline, showing LLMs as scalable alternatives in low-resource biomedical settings.

Details

Motivation: Rare disease NER faces challenges from limited labeled data, semantic ambiguity between entity types, and long-tail distributions, requiring effective solutions for low-resource biomedical applications.

Method: Evaluated GPT-4o using multiple prompt-based strategies: zero-shot prompting, few-shot in-context learning, retrieval-augmented generation (RAG), and task-level fine-tuning. Introduced structured prompting framework with domain knowledge encoding and two semantically guided few-shot example selection methods.

Result: GPT-4o achieved competitive/superior performance vs BioClinicalBERT on RareDis Corpus. Task-level fine-tuning performed best. Few-shot prompting offered high returns at low token budgets. RAG improved recall for challenging entities like signs/symptoms. Error analysis revealed boundary drift and type confusion issues.

Conclusion: Prompt-optimized LLMs serve as effective, scalable alternatives to traditional supervised models for biomedical NER, especially in rare disease applications with scarce annotated data.

Abstract: Named Entity Recognition (NER) in the rare disease domain poses unique challenges due to limited labeled data, semantic ambiguity between entity types, and long-tail distributions. In this study, we evaluate the capabilities of GPT-4o for rare disease NER under low-resource settings, using a range of prompt-based strategies including zero-shot prompting, few-shot in-context learning, retrieval-augmented generation (RAG), and task-level fine-tuning. We design a structured prompting framework that encodes domain-specific knowledge and disambiguation rules for four entity types. We further introduce two semantically guided few-shot example selection methods to improve in-context performance while reducing labeling effort. Experiments on the RareDis Corpus show that GPT-4o achieves competitive or superior performance compared to BioClinicalBERT, with task-level fine-tuning yielding the strongest performance among the evaluated approaches and improving upon the previously reported BioClinicalBERT baseline. Cost-performance analysis reveals that few-shot prompting delivers high returns at low token budgets. RAG provides limited overall gains but can improve recall for challenging entity types, especially signs and symptoms. An error taxonomy highlights common failure modes such as boundary drift and type confusion, suggesting opportunities for post-processing and hybrid refinement. Our results demonstrate that prompt-optimized LLMs can serve as effective, scalable alternatives to traditional supervised models in biomedical NER, particularly in rare disease applications where annotated data is scarce.

[87] Computational Economics in Large Language Models: Exploring Model Behavior and Incentive Design under Resource Constraints

Sandeep Reddy, Kabir Khan, Rohit Patil, Ananya Chakraborty, Faizan A. Khan, Swati Kulkarni, Arjun Verma, Neha Singh

Main category: cs.CL

TL;DR: A computational economics framework treats LLMs as internal economies of resource-constrained agents (attention heads/neurons) that allocate scarce computation to maximize task utility, enabling 40% FLOPS reduction while preserving accuracy.

Details

Motivation: Large language models have substantial computational costs that limit their practical deployment. The paper aims to address this by developing a principled approach to make LLMs more computationally efficient under resource constraints.

Method: Proposes a computational economics framework treating LLMs as internal economies of resource-constrained agents. Uses incentive-driven training that augments task loss with differentiable computation cost term to encourage sparse and efficient activations.

Result: On GLUE (MNLI, STS-B, CoLA) and WikiText-103, the method produces models tracing a Pareto frontier that consistently dominate post-hoc pruning. Achieves ~40% reduction in FLOPS with similar accuracy, lower latency, and more interpretable attention patterns.

Conclusion: Economic principles offer a principled route to designing efficient, adaptive, and more transparent LLMs under strict resource constraints, moving beyond heuristic approaches to computational efficiency.

Abstract: Large language models (LLMs) are limited by substantial computational cost. We introduce a “computational economics” framework that treats an LLM as an internal economy of resource-constrained agents (attention heads and neuron blocks) that must allocate scarce computation to maximize task utility. First, we show empirically that when computation is scarce, standard LLMs reallocate attention toward high-value tokens while preserving accuracy. Building on this observation, we propose an incentive-driven training paradigm that augments the task loss with a differentiable computation cost term, encouraging sparse and efficient activations. On GLUE (MNLI, STS-B, CoLA) and WikiText-103, the method yields a family of models that trace a Pareto frontier and consistently dominate post-hoc pruning; for a similar accuracy we obtain roughly a forty percent reduction in FLOPS and lower latency, together with more interpretable attention patterns. These results indicate that economic principles offer a principled route to designing efficient, adaptive, and more transparent LLMs under strict resource constraints.

[88] The Cultural Gene of Large Language Models: A Study on the Impact of Cross-Corpus Training on Model Values and Biases

Emanuel Z. Fenech-Borg, Tilen P. Meznaric-Kos, Milica D. Lekovic-Bojovic, Arni J. Hentze-Djurhuus

Main category: cs.CL

TL;DR: LLMs exhibit distinct cultural biases reflecting their training data, with Western models showing individualistic/low-power-distance tendencies and Eastern models showing collectivistic/high-power-distance tendencies.

Details

Motivation: To investigate the cultural assumptions embedded in LLMs and quantify their value orientations, addressing concerns about algorithmic cultural hegemony in globally deployed models.

Method: Created Cultural Probe Dataset (CPD) of 200 prompts targeting Individualism-Collectivism and Power Distance dimensions. Compared GPT-4 (Western) and ERNIE Bot (Eastern) using standardized zero-shot prompts with human annotation. Computed Cultural Alignment Index against Hofstede’s national scores.

Result: Significant divergence: GPT-4 shows individualistic/low-power-distance (IDV≈1.21, PDI≈-1.05), ERNIE Bot shows collectivistic/high-power-distance (IDV≈-0.89, PDI≈0.76). GPT-4 aligns with USA cultural scores, ERNIE Bot aligns with China. Differences statistically significant (p<0.001).

Conclusion: LLMs function as statistical mirrors of their cultural training corpora, necessitating culturally aware evaluation and deployment to prevent algorithmic cultural hegemony in global AI systems.

Abstract: Large language models (LLMs) are deployed globally, yet their underlying cultural and ethical assumptions remain underexplored. We propose the notion of a “cultural gene” – a systematic value orientation that LLMs inherit from their training corpora – and introduce a Cultural Probe Dataset (CPD) of 200 prompts targeting two classic cross-cultural dimensions: Individualism-Collectivism (IDV) and Power Distance (PDI). Using standardized zero-shot prompts, we compare a Western-centric model (GPT-4) and an Eastern-centric model (ERNIE Bot). Human annotation shows significant and consistent divergence across both dimensions. GPT-4 exhibits individualistic and low-power-distance tendencies (IDV score approx 1.21; PDI score approx -1.05), while ERNIE Bot shows collectivistic and higher-power-distance tendencies (IDV approx -0.89; PDI approx 0.76); differences are statistically significant (p < 0.001). We further compute a Cultural Alignment Index (CAI) against Hofstede’s national scores and find GPT-4 aligns more closely with the USA (e.g., IDV CAI approx 0.91; PDI CAI approx 0.88) whereas ERNIE Bot aligns more closely with China (IDV CAI approx 0.85; PDI CAI approx 0.81). Qualitative analyses of dilemma resolution and authority-related judgments illustrate how these orientations surface in reasoning. Our results support the view that LLMs function as statistical mirrors of their cultural corpora and motivate culturally aware evaluation and deployment to avoid algorithmic cultural hegemony.

[89] Vis-CoT: A Human-in-the-Loop Framework for Interactive Visualization and Intervention in LLM Chain-of-Thought Reasoning

Kaviraj Pather, Elena Hadjigeorgiou, Arben Krasniqi, Claire Schmit, Irina Rusu, Marc Pons, Kabir Khan

Main category: cs.CL

TL;DR: Vis-CoT is an interactive framework that converts linear chain-of-thought reasoning into visual graphs, allowing human intervention to correct flawed steps and improve LLM reasoning accuracy.

Details

Motivation: Chain-of-thought reasoning in LLMs is opaque, making verification, debugging, and control difficult in high-stakes settings where reliability and trustworthiness are critical.

Method: Vis-CoT converts linear CoT text into interactive reasoning graphs, enabling users to visualize logical flow, identify flawed steps, and intervene by pruning incorrect paths and grafting new user-defined premises.

Result: Across GSM8K and StrategyQA benchmarks, Vis-CoT improves final-answer accuracy by up to 24 percentage points over non-interactive baselines. User studies show significant gains in perceived usability and trust.

Conclusion: Vis-CoT demonstrates a practical path for more reliable, understandable, and collaborative reasoning by combining LLMs with targeted human oversight, shifting interaction from passive observation to active collaboration.

Abstract: Large language models (LLMs) show strong reasoning via chain-of-thought (CoT) prompting, but the process is opaque, which makes verification, debugging, and control difficult in high-stakes settings. We present Vis-CoT, a human-in-the-loop framework that converts linear CoT text into an interactive reasoning graph. Users can visualize the logical flow, identify flawed steps, and intervene by pruning incorrect paths and grafting new, user-defined premises. This shifts interaction from passive observation to active collaboration, steering models toward more accurate and trustworthy conclusions. Across GSM8K and StrategyQA, Vis-CoT improves final-answer accuracy by up to 24 percentage points over non-interactive baselines. A user study also shows large gains in perceived usability and trust. Vis-CoT points to a practical path for more reliable, understandable, and collaborative reasoning by combining LLMs with targeted human oversight.

[90] Trusted Uncertainty in Large Language Models: A Unified Framework for Confidence Calibration and Risk-Controlled Refusal

Markus Oehri, Giulia Conti, Kaviraj Pather, Alexandre Rossi, Laia Serra, Adrian Parody, Rogvi Johannesen, Aviaja Petersen, Arben Krasniqi

Main category: cs.CL

TL;DR: UniCR is a unified framework that converts various uncertainty evidence into calibrated correctness probabilities and enforces user-specified error budgets through principled refusal mechanisms.

Details

Motivation: Language models need to know not only what to answer but also when not to answer, requiring reliable uncertainty estimation and calibrated refusal mechanisms to improve trustworthiness.

Method: UniCR learns a lightweight calibration head with temperature scaling and proper scoring, supports API-only models via black-box features, uses conformal risk control for distribution-free guarantees, and aligns confidence with semantic fidelity for long-form generation using atomic factuality scores from retrieved evidence.

Result: Experiments on short-form QA, code generation with execution tests, and retrieval-augmented long-form QA show consistent improvements in calibration metrics, lower area under risk-coverage curve, and higher coverage at fixed risk compared to baselines.

Conclusion: UniCR provides a portable recipe for evidence fusion to calibrated probability to risk-controlled decision that improves trustworthiness without fine-tuning base models and remains valid under distribution shift, with evidence contradiction, semantic dispersion, and tool inconsistency being key abstention drivers.

Abstract: Deployed language models must decide not only what to answer but also when not to answer. We present UniCR, a unified framework that turns heterogeneous uncertainty evidence including sequence likelihoods, self-consistency dispersion, retrieval compatibility, and tool or verifier feedback into a calibrated probability of correctness and then enforces a user-specified error budget via principled refusal. UniCR learns a lightweight calibration head with temperature scaling and proper scoring, supports API-only models through black-box features, and offers distribution-free guarantees using conformal risk control. For long-form generation, we align confidence with semantic fidelity by supervising on atomic factuality scores derived from retrieved evidence, reducing confident hallucinations while preserving coverage. Experiments on short-form QA, code generation with execution tests, and retrieval-augmented long-form QA show consistent improvements in calibration metrics, lower area under the risk-coverage curve, and higher coverage at fixed risk compared to entropy or logit thresholds, post-hoc calibrators, and end-to-end selective baselines. Analyses reveal that evidence contradiction, semantic dispersion, and tool inconsistency are the dominant drivers of abstention, yielding informative user-facing refusal messages. The result is a portable recipe of evidence fusion to calibrated probability to risk-controlled decision that improves trustworthiness without fine-tuning the base model and remains valid under distribution shift.

[91] No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

Thanh-Long V. Le, Myeongho Jeon, Kim Vu, Viet Lai, Eunho Yang

Main category: cs.CL

TL;DR: RL-ZVP improves RLVR by learning from zero-variance prompts where all responses get same reward, achieving significant gains over methods that ignore such prompts.

Details

Motivation: Current RLVR methods like GRPO ignore zero-variance prompts where all model responses receive the same reward, missing valuable learning opportunities.

Method: RL-ZVP extracts learning signals from zero-variance prompts by directly rewarding correctness and penalizing errors without contrasting responses, using token-level characteristics to modulate feedback.

Result: Across six math reasoning benchmarks, RL-ZVP achieves up to 8.61 points accuracy and 7.77 points pass rate improvements over GRPO, consistently outperforming baselines that filter zero-variance prompts.

Conclusion: Zero-variance prompts contain valuable learning signals, and RL-ZVP demonstrates their untapped potential in RLVR for improving LLM reasoning abilities.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward – so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce RL with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR.

[92] Breadcrumbs Reasoning: Memory-Efficient Reasoning with Compression Beacons

Giovanni Monea, Yair Feldman, Shankar Padmanabhan, Kianté Brantley, Yoav Artzi

Main category: cs.CL

TL;DR: A method to compress Transformer KV cache during long-context reasoning by learning to compress past tokens into special-purpose tokens, reducing memory/computational costs while maintaining accuracy.

Details

Motivation: Large language models face scalability issues for long-context reasoning due to linear growth of Transformer key-value cache, which causes significant memory and computational overhead. The authors observe that as models generate reasoning tokens, the informational value of past tokens diminishes, creating compression opportunities.

Method: Periodically compress generation KV cache using learned, special-purpose tokens and evict compressed entries. Train the model via modified joint distillation and reinforcement learning framework that leverages RL outputs for distillation, minimizing training overhead.

Result: Achieves superior memory-accuracy Pareto frontier compared to both baseline model without cache compression and training-free compression techniques.

Conclusion: The proposed learned compression method effectively reduces KV cache memory/computational costs for long-context reasoning while maintaining model accuracy, offering a practical solution to Transformer scalability limitations.

Abstract: The scalability of large language models for long-context reasoning is severely constrained by the linear growth of their Transformer key-value cache, which incurs significant memory and computational costs. We posit that as a model generates reasoning tokens, the informational value of past generated tokens diminishes, creating an opportunity for compression. In this work, we propose to periodically compress the generation KV cache with a learned, special-purpose token and evict compressed entries. We train the model to perform this compression via a modified joint distillation and reinforcement learning (RL) framework. Our training method minimizes overhead over the conventional RL process, as it leverages RL outputs for distillation. Empirically, our method achieves a superior memory-accuracy Pareto frontier compared to both the model without cache compression and training-free compression techniques.

[93] Attention Is All You Need for KV Cache in Diffusion LLMs

Quan Nguyen-Tri, Mukul Ranjan, Zhiqiang Shen

Main category: cs.CL

TL;DR: Elastic-Cache is a training-free method that adaptively recomputes KV caches for diffusion LLMs to reduce redundant computation and accelerate decoding while maintaining generation quality.

Details

Motivation: Current diffusion LLMs recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial computational redundancy and latency.

Method: Elastic-Cache makes three key observations about KV cache behavior and uses them to create an adaptive strategy: (1) distant MASK tokens act as length-bias and can be cached block-wise, (2) KV dynamics increase with depth, allowing selective refresh from deeper layers, and (3) the most-attended token shows smallest KV drift, providing a conservative bound. The method jointly decides when to refresh (via attention-aware drift test) and where to refresh (via depth-aware schedule).

Result: Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation show consistent speedups: 8.7× on GSM8K (256 tokens), 45.1× on longer sequences, with higher accuracy than baseline. Achieves 6.8× higher throughput on GSM8K than confidence-based approaches while preserving generation quality.

Conclusion: Elastic-Cache enables practical deployment of diffusion LLMs by significantly reducing computational redundancy through adaptive, layer-aware cache updates, achieving substantial speedups with negligible quality loss.

Abstract: This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods’ decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy. We make three observations: (1) distant ${\bf MASK}$ tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens. Building on these, we propose ${\bf Elastic-Cache}$, a training-free, architecture-agnostic strategy that jointly decides ${when}$ to refresh (via an attention-aware drift test on the most-attended token) and ${where}$ to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: $8.7\times$ on GSM8K (256 tokens), and $45.1\times$ on longer sequences, while consistently maintaining higher accuracy than the baseline. Our method achieves significantly higher throughput ($6.8\times$ on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.

[94] TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs

Sibo Xiao, Jinyuan Fu, Zhongle Xie, Lidan Shou

Main category: cs.CL

TL;DR: TokenTiming enables universal speculative decoding by using Dynamic Time Warping to align draft and target model tokens with mismatched vocabularies, achieving 1.57x speedup without retraining.

Details

Motivation: Current speculative decoding is limited because draft and target models must share the same vocabulary, restricting draft model selection and often requiring training new models from scratch.

Method: Proposes TokenTiming algorithm that re-encodes draft token sequences and uses Dynamic Time Warping (DTW) to build mappings between mismatched vocabularies for probability distribution transfer in speculative sampling.

Result: Achieves 1.57x speedup across various tasks, enabling universal draft model selection without retraining or model modification.

Conclusion: TokenTiming makes speculative decoding a more versatile and practical tool for LLM acceleration by accommodating mismatched vocabularies and working with any off-the-shelf models.

Abstract: Accelerating the inference of large language models (LLMs) has been a critical challenge in generative AI. Speculative decoding (SD) substantially improves LLM inference efficiency. However, its utility is limited by a fundamental constraint: the draft and target models must share the same vocabulary, thus limiting the herd of available draft models and often necessitating the training of a new model from scratch. Inspired by Dynamic Time Warping (DTW), a classic algorithm for aligning time series, we propose the algorithm TokenTiming for universal speculative decoding. It operates by re-encoding the draft token sequence to get a new target token sequence, and then uses DTW to build a mapping to transfer the probability distributions for speculative sampling. Benefiting from this, our method accommodates mismatched vocabularies and works with any off-the-shelf models without retraining and modification. We conduct comprehensive experiments on various tasks, demonstrating 1.57x speedup. This work enables a universal approach for draft model selection, making SD a more versatile and practical tool for LLM acceleration.

[95] Think Parallax: Solving Multi-Hop Problems via Multi-View Knowledge-Graph-Based Retrieval-Augmented Generation

Jinliang Liu, Jiale Bai, Shaoning Zeng

Main category: cs.CL

TL;DR: ParallaxRAG is a multi-view KG-RAG framework that decouples queries and graph triples into specialized attention head spaces to improve multi-hop reasoning and reduce hallucination in LLMs.

Details

Motivation: LLMs struggle with hallucination and multi-hop reasoning, while existing KG-RAG methods rely on flat embeddings and noisy path exploration, lacking principled approaches for multi-stage reasoning.

Method: Symmetrically decouples queries and graph triples into multi-view spaces using attention head specialization, enabling robust retrieval with head diversity constraints and weakly related path filtering to construct cleaner subgraphs.

Result: Competitive retrieval and QA performance on WebQSP and CWQ benchmarks with reduced hallucination and good generalization, using BGE-M3 + Llama3.1-8B setup.

Conclusion: Multi-view head specialization provides a principled direction for knowledge-grounded multi-hop reasoning, offering improved grounding and step-wise reasoning capabilities for LLMs.

Abstract: Large language models (LLMs) excel at language understanding but often hallucinate and struggle with multi-hop reasoning. Knowledge-graph-based retrieval-augmented generation (KG-RAG) offers grounding, yet most methods rely on flat embeddings and noisy path exploration. We propose ParallaxRAG, a framework that symmetrically decouples queries and graph triples into multi-view spaces, enabling a robust retrieval architecture that explicitly enforces head diversity while constraining weakly related paths. Central to our approach is the observation that different attention heads specialize in semantic relations at distinct reasoning stages, contributing to different hops of the reasoning chain. This specialization allows ParallaxRAG to construct cleaner subgraphs and guide LLMs through grounded, step-wise reasoning. Experiments on WebQSP and CWQ, under our unified, reproducible setup (BGE-M3 + Llama3.1-8B), demonstrate competitive retrieval and QA performance, alongside reduced hallucination and good generalization. Our results highlight multi-view head specialization as a principled direction for knowledge-grounded multi-hop reasoning. Our implementation will be released as soon as the paper is accepted.

[96] The Gray Zone of Faithfulness: Taming Ambiguity in Unfaithfulness Detection

Qiang Ding, Lvzhou Luo, Yixuan Cao, Ping Luo

Main category: cs.CL

TL;DR: The paper introduces VeriGray, a new faithfulness benchmark for LLM summarization that addresses annotation ambiguity by adding an “Out-Dependent” category for cases requiring external knowledge verification.

Details

Motivation: Existing faithfulness benchmarks suffer from annotation ambiguity due to ill-defined boundaries of permissible external knowledge. Common sense is often labeled as "faithful" without clear guidelines on acceptable knowledge incorporation, leading to inconsistent annotations.

Method: Proposes a novel faithfulness annotation framework with an intermediate “Out-Dependent” category for cases requiring external knowledge verification. Uses this framework to construct VeriGray, a new unfaithfulness detection benchmark for summarization tasks.

Result: Even state-of-the-art LLMs like GPT-5 exhibit hallucinations (~6% of sentences) in summarization. A substantial proportion (~9% on average) of generated sentences fall into the Out-Dependent category, highlighting the importance of resolving annotation ambiguity. The benchmark poses significant challenges to baseline methods.

Conclusion: The VeriGray benchmark addresses critical annotation ambiguity in faithfulness evaluation and reveals significant room for improvement in LLM summarization faithfulness, particularly in handling cases requiring external knowledge verification.

Abstract: Ensuring that Large Language Models (LLMs) generate summaries faithful to a given source document is essential for real-world applications. While prior research has explored LLM faithfulness, existing benchmarks suffer from annotation ambiguity, primarily due to the ill-defined boundary of permissible external knowledge in generated outputs. For instance, common sense is often incorporated into responses and labeled as “faithful”, yet the acceptable extent of such knowledge remains unspecified, leading to inconsistent annotations. To address this issue, we propose a novel faithfulness annotation framework, which introduces an intermediate category, Out-Dependent, to classify cases where external knowledge is required for verification. Using this framework, we construct VeriGray (Verification with the Gray Zone) – a new unfaithfulness detection benchmark in summarization. Statistics reveal that even SOTA LLMs, such as GPT-5, exhibit hallucinations ($\sim 6%$ of sentences) in summarization tasks. Moreover, a substantial proportion ($\sim 9%$ on average of models) of generated sentences fall into the Out-Dependent category, underscoring the importance of resolving annotation ambiguity in unfaithfulness detection benchmarks. Experiments demonstrate that our benchmark poses significant challenges to multiple baseline methods, indicating considerable room for future improvement.

[97] Cognitive Alignment in Personality Reasoning: Leveraging Prototype Theory for MBTI Inference

Haoyuan Li, Yuanbo Tong, Yuchen Li, Zirui Wang, Chunhou Liu, Jiamou Liu

Main category: cs.CL

TL;DR: ProtoMBTI: A prototype-based LLM framework for MBTI personality recognition that improves accuracy and interpretability by aligning with psychological prototype theory.

Details

Motivation: Traditional hard-label classification for personality recognition obscures the graded, prototype-like nature of human personality judgments. Current approaches don't align well with how humans actually make personality judgments.

Method: 1) Construct balanced corpus via LLM-guided multi-dimensional augmentation (semantic, linguistic, sentiment). 2) LoRA-fine-tune lightweight encoder to learn embeddings and standardize personality prototypes. 3) Inference uses retrieve-reuse-revise-retain cycle: retrieve top-k prototypes, aggregate evidence via prompt-based voting, revise inconsistencies, and retain correct samples to enrich prototype library.

Result: ProtoMBTI improves over baselines on both four MBTI dichotomies and full 16-type task across Kaggle and Pandora benchmarks. Shows robust cross-dataset generalization.

Conclusion: Aligning inference process with psychological prototype reasoning yields gains in accuracy, interpretability, and transfer for text-based personality modeling.

Abstract: Personality recognition from text is typically cast as hard-label classification, which obscures the graded, prototype-like nature of human personality judgments. We present ProtoMBTI, a cognitively aligned framework for MBTI inference that operationalizes prototype theory within an LLM-based pipeline. First, we construct a balanced, quality-controlled corpus via LLM-guided multi-dimensional augmentation (semantic, linguistic, sentiment). Next, we LoRA-fine-tune a lightweight (<=2B) encoder to learn discriminative embeddings and to standardize a bank of personality prototypes. At inference, we retrieve top-k prototypes for a query post and perform a retrieve–reuse–revise–retain cycle: the model aggregates prototype evidence via prompt-based voting, revises when inconsistencies arise, and, upon correct prediction, retains the sample to continually enrich the prototype library. Across Kaggle and Pandora benchmarks, ProtoMBTI improves over baselines on both the four MBTI dichotomies and the full 16-type task, and exhibits robust cross-dataset generalization. Our results indicate that aligning the inference process with psychological prototype reasoning yields gains in accuracy, interpretability, and transfer for text-based personality modeling.

[98] Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning

Wenjin Liu, Haoran Luo, Xueyuan Lin, Haoming Liu, Tiesunlong Shen, Jiapu Wang, Rui Mao, Erik Cambria

Main category: cs.CL

TL;DR: Prompt-R1 is an RL framework using small LLMs to generate prompts for large LLMs, improving performance on complex tasks without needing expert prompting from users.

Details

Motivation: Users struggle to provide accurate prompts for complex problems with LLMs, limiting model performance despite rapid advancements in large language models.

Method: End-to-end reinforcement learning framework where small LLMs generate prompts for large LLMs in multi-turn interactions, with dual-constrained rewards optimizing correctness, quality, and reasoning accuracy.

Result: Significantly outperforms baseline models across multiple public datasets, providing plug-and-play support for both inference and training with various large-scale LLMs.

Conclusion: Prompt-R1 effectively addresses the prompt engineering challenge by enabling small LLMs to collaborate with large LLMs, improving performance on complex reasoning tasks.

Abstract: Recently, advanced large language models (LLMs) have emerged at an increasingly rapid pace. However, when faced with complex problems, most users are often unable to provide accurate and effective prompts to interact with LLMs, thus limiting the performance of LLMs. To address this challenge, we propose Prompt-R1, an end-to-end reinforcement learning framework that uses a small-scale LLM to collaborate with large-scale LLMs, replacing user interaction to solve problems better. This collaboration is cast as a multi-turn prompt interaction, where the small-scale LLM thinks and generates prompts, and the large-scale LLM performs complex reasoning. A dual-constrained reward is designed to optimize for correctness, generation quality, and reasoning accuracy. Prompt-R1 provides a plug-and-play framework that supports both inference and training with various large-scale LLMs. Experiments on multiple public datasets show that Prompt-R1 significantly outperforms baseline models across tasks. Our code is publicly available at https://github.com/QwenQKing/Prompt-R1.

Kaiyuan Zhang, Chenghao Yang, Zhoufutu Wen, Sihang Yuan, Qiuyue Wang, Chaoyi Huang, Guosheng Zhu, He Wang, Huawenyu Lu, Jianing Wen, Jianpeng Jiao, Lishu Luo, Longxiang Liu, Sijin Wu, Xiaolei Zhu, Xuanliang Zhang, Yu Liu, Ge Zhang, Yi Lin, Guang Shi, Chaoyou Fu, Wenhao Huang

Main category: cs.CL

TL;DR: MME-CC is a new multimodal benchmark focusing on vision-centric cognitive behaviors (spatial, geometric, knowledge reasoning) to better assess MLLMs’ cognitive capacity beyond text-heavy evaluations.

Details

Motivation: Existing multimodal benchmarks either overemphasize textual reasoning or fail to systematically capture vision-centric cognitive behaviors, leaving MLLMs' cognitive capacity insufficiently assessed despite rapid scaling of reasoning models.

Method: Introduces MME-CC benchmark organizing 11 reasoning tasks into three categories (spatial, geometric, knowledge-based reasoning), providing fine-grained analyses of MLLMs’ cognitive capacity across these dimensions.

Result: Closed-source models lead overall (Gemini-2.5-Pro: 42.66 vs GLM-4.5V: 30.45), spatial/geometric reasoning remain weak (≤30%), common error patterns identified (orientation mistakes, fragile cross-view identity persistence, poor counterfactual adherence), Chain-of-Thought follows three-stage process with heavy visual extraction reliance.

Conclusion: This work aims to catalyze a shift toward treating cognitive capacity of MLLMs as central to both evaluation and model design, highlighting the need for better vision-centric cognitive assessment.

Abstract: As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs insufficiently assessed. To address this limitation, we introduce MME-CC (Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information: spatial, geometric, and knowledge-based reasoning, and provides fine-grained analyses of MLLMs’ cognitive capacity across these dimensions. Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs. Our study reveals that closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs. 30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak (less than or equal to 30%). We further identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions, and observe that Chain-of-Thought typically follows a three-stage process (extract -> reason -> verify) with heavy reliance on visual extraction. We hope this work catalyzes a shift toward treating the cognitive capacity of MLLMs as central to both evaluation and model design.

[100] Can Finetuing LLMs on Small Human Samples Increase Heterogeneity, Alignment, and Belief-Action Coherence?

Steven Wang, Kyle Hunt, Shaojie Tang, Kenneth Joseph

Main category: cs.CL

TL;DR: Fine-tuning LLMs on small human samples improves simulated survey responses but still fails to reproduce regression coefficients needed for inferential analysis.

Details

Motivation: To determine if fine-tuning LLMs on small human survey data (like pilot studies) can address known limitations of LLM-based participant simulation, specifically issues with diversity, subgroup alignment, and belief-action coherence.

Method: Used a behavioral experiment on information disclosure to compare human vs. LLM-generated responses across multiple dimensions: distributional divergence, subgroup alignment, belief-action coherence, and recovery of regression coefficients. Fine-tuned LLMs on small subsets of human survey data.

Result: Fine-tuning on small human samples substantially improved heterogeneity, alignment, and belief-action coherence compared to base models. However, even the best fine-tuned models failed to reproduce the regression coefficients from the original human study.

Conclusion: While fine-tuning improves some aspects of LLM simulation, LLM-generated data remains unsuitable for replacing human participants in formal inferential analyses due to inability to reproduce regression coefficients.

Abstract: There is ongoing debate about whether large language models (LLMs) can serve as substitutes for human participants in survey and experimental research. While recent work in fields such as marketing and psychology has explored the potential of LLM-based simulation, a growing body of evidence cautions against this practice: LLMs often fail to align with real human behavior, exhibiting limited diversity, systematic misalignment for minority subgroups, insufficient within-group variance, and discrepancies between stated beliefs and actions. This study examines an important and distinct question in this domain: whether fine-tuning on a small subset of human survey data, such as that obtainable from a pilot study, can mitigate these issues and yield realistic simulated outcomes. Using a behavioral experiment on information disclosure, we compare human and LLM-generated responses across multiple dimensions, including distributional divergence, subgroup alignment, belief-action coherence, and the recovery of regression coefficients. We find that fine-tuning on small human samples substantially improves heterogeneity, alignment, and belief-action coherence relative to the base model. However, even the best-performing fine-tuned models fail to reproduce the regression coefficients of the original study, suggesting that LLM-generated data remain unsuitable for replacing human participants in formal inferential analyses.

[101] Dual LoRA: Enhancing LoRA with Magnitude and Direction Updates

Yixing Xu, Chao Li, Xuanwu Yin, Spandan Tiwari, Dong Li, Ashish Sirasao, Emad Barsoum

Main category: cs.CL

TL;DR: Dual LoRA improves LoRA performance by separating low-rank matrices into magnitude and direction groups with ReLU and sign functions to better simulate full fine-tuning parameter updates.

Details

Motivation: Standard LoRA often has unsatisfactory performance due to its low-rank assumption, which doesn't properly simulate the parameter updating process of full fine-tuning based on gradient-based optimization algorithms.

Method: Separates low-rank matrices into two groups: magnitude group (controls whether/how far to update parameters) and direction group (decides forward/backward movement). Uses ReLU function for magnitude group and sign function for direction group to better simulate full fine-tuning parameter updates.

Result: Consistently outperforms LoRA and its state-of-the-art variants with the same number of trainable parameters across various NLP tasks including natural language understanding and commonsense reasoning on RoBERTa, DeBERTa, and LLaMA-1/2/3 models.

Conclusion: Dual LoRA effectively improves LoRA performance by incorporating inductive bias that better simulates full fine-tuning parameter updates while maintaining parameter efficiency.

Abstract: Low-rank adaptation (LoRA) is one of the most popular methods among parameter-efficient fine-tuning (PEFT) methods to adapt pre-trained large language models (LLMs) to specific downstream tasks. However, the model trained based on LoRA often has an unsatisfactory performance due to its low-rank assumption. In this paper, we propose a novel method called Dual LoRA to improve the performance by incorporating an inductive bias into the original LoRA. Specifically, we separate low-rank matrices into two groups: the magnitude group to control whether or not and how far we should update a parameter and the direction group to decide whether this parameter should move forward or backward, to better simulate the parameter updating process of the full fine-tuning based on gradient-based optimization algorithms. We show that this can be simply achieved by adding a ReLU function to the magnitude group and a sign function to the direction group. We conduct several experiments over a wide range of NLP tasks, including natural language understanding (NLU) and commonsense reasoning datasets on RoBERTa, DeBERTa, and LLaMA-1/2/3 as baseline models. The results show that we consistently outperform LoRA and its state-of-the-art variants with the same number of trainable parameters.

[102] Do You Feel Comfortable? Detecting Hidden Conversational Escalation in AI Chatbots

Jihyung Park, Saleh Afroogh, Junfeng Jiao

Main category: cs.CL

TL;DR: GAUGE is a logit-based framework for real-time detection of implicit conversational harm in LLMs, focusing on affective state escalation that traditional toxicity filters miss.

Details

Motivation: LLMs are increasingly used as emotional companions, but they can cause implicit harm through emotional reinforcement or affective drift that gradually escalates distress. Traditional toxicity filters fail to detect this subtle harm, and existing guardrails using external classifiers or clinical rubrics lag behind real-time conversational dynamics.

Method: GAUGE (Guarding Affective Utterance Generation Escalation) is a logit-based framework that measures how an LLM’s output probabilistically shifts the affective state of a dialogue in real-time.

Result: The paper proposes a novel framework for detecting hidden conversational escalation, addressing the gap in current safety mechanisms that fail to capture nuanced, real-time affective dynamics.

Conclusion: GAUGE provides a real-time solution for detecting implicit harm in LLM conversations by monitoring probabilistic affective state shifts, offering a more nuanced approach than traditional toxicity filters.

Abstract: Large Language Models (LLM) are increasingly integrated into everyday interactions, serving not only as information assistants but also as emotional companions. Even in the absence of explicit toxicity, repeated emotional reinforcement or affective drift can gradually escalate distress in a form of \textit{implicit harm} that traditional toxicity filters fail to detect. Existing guardrail mechanisms often rely on external classifiers or clinical rubrics that may lag behind the nuanced, real-time dynamics of a developing conversation. To address this gap, we propose GAUGE (Guarding Affective Utterance Generation Escalation), logit-based framework for the real-time detection of hidden conversational escalation. GAUGE measures how an LLM’s output probabilistically shifts the affective state of a dialogue.

[103] Complementary Learning Approach for Text Classification using Large Language Models

Navid Asgari, Benjamin M. Cole

Main category: cs.CL

TL;DR: Proposes a cost-efficient LLM methodology for human-machine teams in quantitative research, using chain-of-thought and few-shot learning to leverage human abductive reasoning and manage LLM weaknesses.

Details

Motivation: To develop a structured approach that integrates human and machine strengths while offsetting their weaknesses in quantitative research, enabling cost-effective use of LLMs while maintaining scholarly rigor.

Method: Uses chain-of-thought and few-shot learning prompting techniques from computer science, extending qualitative co-author team practices to human-machine teams in quantitative research. Allows humans to use abductive reasoning and natural language to interrogate both machine and human outputs.

Result: Demonstrated the methodology by interrogating human-machine rating discrepancies in a sample of 1,934 pharmaceutical alliance press releases (1990-2017), showing how scholars can manage LLM weaknesses with careful, low-cost techniques.

Conclusion: The proposed methodology enables effective human-machine collaboration in quantitative research, allowing scholars to leverage LLMs cost-efficiently while maintaining critical oversight through abductive reasoning and systematic interrogation of discrepancies.

Abstract: In this study, we propose a structured methodology that utilizes large language models (LLMs) in a cost-efficient and parsimonious manner, integrating the strengths of scholars and machines while offsetting their respective weaknesses. Our methodology, facilitated through a chain of thought and few-shot learning prompting from computer science, extends best practices for co-author teams in qualitative research to human-machine teams in quantitative research. This allows humans to utilize abductive reasoning and natural language to interrogate not just what the machine has done but also what the human has done. Our method highlights how scholars can manage inherent weaknesses OF LLMs using careful, low-cost techniques. We demonstrate how to use the methodology to interrogate human-machine rating discrepancies for a sample of 1,934 press releases announcing pharmaceutical alliances (1990-2017).

[104] Understanding Syllogistic Reasoning in LLMs from Formal and Natural Language Perspectives

Aheli Poddar, Saptarshi Sahoo, Sujata Ghosh

Main category: cs.CL

TL;DR: LLMs show varying syllogistic reasoning capabilities across 14 models, with some achieving perfect symbolic performance, raising questions about whether LLMs are developing formal reasoning mechanisms rather than capturing human reasoning nuances.

Details

Motivation: To investigate fundamental reasoning capabilities of LLMs through syllogistic reasoning from both logical and natural language perspectives, and to understand the direction of LLM reasoning research.

Method: Used 14 large language models to evaluate syllogistic reasoning capabilities through both symbolic inferences and natural language understanding tasks.

Result: Syllogistic reasoning is not uniformly emergent across LLMs, but some models achieve perfect symbolic performance, suggesting LLMs may be developing formal reasoning mechanisms rather than capturing human reasoning nuances.

Conclusion: The research raises important questions about whether LLMs are evolving toward formal reasoning systems rather than replicating the nuanced aspects of human reasoning, highlighting the need for deeper investigation into the nature of LLM reasoning capabilities.

Abstract: We study syllogistic reasoning in LLMs from the logical and natural language perspectives. In process, we explore fundamental reasoning capabilities of the LLMs and the direction this research is moving forward. To aid in our studies, we use 14 large language models and investigate their syllogistic reasoning capabilities in terms of symbolic inferences as well as natural language understanding. Even though this reasoning mechanism is not a uniform emergent property across LLMs, the perfect symbolic performances in certain models make us wonder whether LLMs are becoming more and more formal reasoning mechanisms, rather than making explicit the nuances of human reasoning.

[105] Authors Should Label Their Own Documents

Marcus Ma, Cole Johnson, Nolan Bridges, Jackson Trager, Georgios Chochlakis, Shrikanth Narayanan

Main category: cs.CL

TL;DR: Author labeling: writers annotate their own text at creation, outperforming third-party annotation for subjective tasks like sentiment analysis.

Details

Motivation: Third-party annotation is insufficient for capturing egocentric information like sentiment and belief, which can only be approximated by external annotators. There's a need for more accurate labeling of subjective content.

Method: Collaborated with commercial chatbot (20,000+ users) to deploy author labeling system that identifies task-relevant queries, generates on-the-fly labeling questions, and records authors’ answers in real time. Trained online-learning model for product recommendation using author-labeled data.

Result: Model achieved 537% improvement in click-through rate vs industry advertising baseline. Author labeling outperformed three traditional annotation approaches in quality, speed, and cost for sentiment analysis.

Conclusion: Author labeling produces higher quality annotations for egocentric/subjective beliefs than third-party annotation. The approach is faster, cheaper, and more accurate. Authors released a service for research community adoption.

Abstract: Third-party annotation is the status quo for labeling text, but egocentric information such as sentiment and belief can at best only be approximated by a third-person proxy. We introduce author labeling, an annotation technique where the writer of the document itself annotates the data at the moment of creation. We collaborate with a commercial chatbot with over 20,000 users to deploy an author labeling annotation system. This system identifies task-relevant queries, generates on-the-fly labeling questions, and records authors’ answers in real time. We train and deploy an online-learning model architecture for product recommendation with author-labeled data to improve performance. We train our model to minimize the prediction error on questions generated for a set of predetermined subjective beliefs using author-labeled responses. Our model achieves a 537% improvement in click-through rate compared to an industry advertising baseline running concurrently. We then compare the quality and practicality of author labeling to three traditional annotation approaches for sentiment analysis and find author labeling to be higher quality, faster to acquire, and cheaper. These findings reinforce existing literature that annotations, especially for egocentric and subjective beliefs, are significantly higher quality when labeled by the author rather than a third party. To facilitate broader scientific adoption, we release an author labeling service for the research community at https://academic.echollm.io.

[106] From Context to EDUs: Faithful and Structured Context Compression via Elementary Discourse Unit Decomposition

Yiqing Zhou, Yu Lei, Shuzheng Si, Qingyan Sun, Wei Wang, Yifei Wu, Hao Wen, Gang Chen, Fanchao Qi, Maosong Sun

Main category: cs.CL

TL;DR: EDU-based Context Compressor: A novel explicit compression framework that transforms text into structural relation trees of Elementary Discourse Units (EDUs) to preserve both global structure and fine-grained details while reducing computational costs.

Details

Motivation: Managing extensive context is a critical bottleneck for LLMs in applications like long-document QA and autonomous agents, where lengthy inputs cause high computational costs and introduce noise. Existing compression techniques either disrupt local coherence through discrete token removal or rely on implicit latent encoding with positional bias and API incompatibility issues.

Method: 1. LingoEDU transforms linear text into a structural relation tree of Elementary Discourse Units (EDUs) anchored to source indices to eliminate hallucination. 2. A lightweight ranking module selects query-relevant sub-trees for linearization. The approach reformulates context compression as a structure-then-select process.

Result: The method achieves state-of-the-art structural prediction accuracy and significantly outperforms frontier LLMs while reducing costs. It also releases StructBench, a manually annotated dataset of 248 diverse documents for rigorous evaluation. Structure-aware compression substantially enhances performance across downstream tasks from long-context tasks to complex Deep Search scenarios.

Conclusion: The EDU-based Context Compressor provides an effective explicit compression framework that preserves both global structure and fine-grained details, addressing limitations of existing methods while improving performance and reducing computational costs for LLM applications.

Abstract: Managing extensive context remains a critical bottleneck for Large Language Models (LLMs), particularly in applications like long-document question answering and autonomous agents where lengthy inputs incur high computational costs and introduce noise. Existing compression techniques often disrupt local coherence through discrete token removal or rely on implicit latent encoding that suffers from positional bias and incompatibility with closed-source APIs. To address these limitations, we introduce the EDU-based Context Compressor, a novel explicit compression framework designed to preserve both global structure and fine-grained details. Our approach reformulates context compression as a structure-then-select process. First, our LingoEDU transforms linear text into a structural relation tree of Elementary Discourse Units (EDUs) which are anchored strictly to source indices to eliminate hallucination. Second, a lightweight ranking module selects query-relevant sub-trees for linearization. To rigorously evaluate structural understanding, we release StructBench, a manually annotated dataset of 248 diverse documents. Empirical results demonstrate that our method achieves state-of-the-art structural prediction accuracy and significantly outperforms frontier LLMs while reducing costs. Furthermore, our structure-aware compression substantially enhances performance across downstream tasks ranging from long-context tasks to complex Deep Search scenarios.

[107] Rakuten Data Release: A Large-Scale and Long-Term Reviews Corpus for Hotel Domain

Yuki Nakayama, Koki Hikichi, Yun Ching Liu, Yu Hirate

Main category: cs.CL

TL;DR: Large-scale Rakuten Travel Reviews corpus with 7.29M reviews from 2009-2024, featuring rich metadata and aspect ratings, with analysis of data drift patterns.

Details

Motivation: To create a comprehensive, large-scale dataset of travel reviews for research purposes, enabling analysis of customer feedback trends, accommodation performance, and temporal data patterns over a 16-year period.

Method: Collection of 7.29 million customer reviews from Rakuten Travel spanning 2009-2024, with detailed metadata including review text, responses, reviewer IDs, accommodation details, and six aspect ratings plus overall scores. Statistical analysis used to examine data drift patterns between 2019-2024.

Result: Created a massive corpus with rich structured data including: review text, accommodation responses, reviewer IDs, dates, accommodation/plan IDs, room details, purpose, group information, and multi-aspect ratings. Statistical insights reveal data drift patterns between 2019-2024.

Conclusion: The Rakuten Travel Reviews corpus provides a valuable resource for travel industry research, sentiment analysis, recommendation systems, and temporal trend analysis, with identified data drift patterns offering insights into changing customer behavior and platform dynamics.

Abstract: This paper presents a large-scale corpus of Rakuten Travel Reviews. Our collection contains 7.29 million customer reviews for 16 years, ranging from 2009 to 2024. Each record in the dataset contains the review text, its response from an accommodation, an anonymized reviewer ID, review date, accommodation ID, plan ID, plan title, room type, room name, purpose, accompanying group, and user ratings from six aspect categories, as well as an overall score. We present statistical information about our corpus and provide insights into factors driving data drift between 2019 and 2024 using statistical approaches.

[108] MDToC: Metacognitive Dynamic Tree of Concepts for Boosting Mathematical Problem-Solving of Large Language Models

Tung Duong Ta, Tim Oates, Thien Van Luong, Huan Vu, Tien Cuong Nguyen

Main category: cs.CL

TL;DR: MDToC (Metacognitive Dynamic Tree of Concepts) is a three-phase approach that improves mathematical reasoning in LLMs by constructing concept trees, verifying calculations, and using majority voting, outperforming existing methods like GoT and ToT across multiple benchmarks.

Details

Motivation: Despite advances in mathematical reasoning, LLMs still struggle with calculation verification using existing prompting techniques. There's a need for better methods to improve accuracy in mathematical problem-solving without relying on hand-engineered hints.

Method: Three-phase approach: 1) Constructs a concept tree to break down problems, 2) Develops accuracy-verified calculations for each concept, 3) Employs majority voting to evaluate competing solutions. This metacognitive framework enables dynamic verification of calculations throughout the reasoning process.

Result: MDToC achieves state-of-the-art performance: GPT-4-Turbo scores 58.1% on CHAMP, 86.6% on MATH, and 85% on Game-of-24. Outperforms GoT by 5%, 5.4%, and 4% respectively, and improves up to 7.6% over ToT and 6.2% over GoT across all backbone models.

Conclusion: MDToC establishes metacognitive calculation verification as a promising direction for enhanced mathematical reasoning in LLMs, consistently surpassing existing prompting methods without requiring hand-engineered hints.

Abstract: Despite advances in mathematical reasoning capabilities, Large Language Models (LLMs) still struggle with calculation verification when using established prompting techniques. We present MDToC (Metacognitive Dynamic Tree of Concepts), a three-phase approach that constructs a concept tree, develops accuracy-verified calculations for each concept, and employs majority voting to evaluate competing solutions. Evaluations across CHAMP, MATH, and Game-of-24 benchmarks demonstrate our MDToC’s effectiveness, with GPT-4-Turbo achieving 58.1% on CHAMP, 86.6% on MATH, and 85% on Game-of-24 - outperforming GoT by 5%, 5.4%, and 4% on all these tasks, respectively, without hand-engineered hints. MDToC consistently surpasses existing prompting methods across all backbone models, yielding improvements of up to 7.6% over ToT and 6.2% over GoT, establishing metacognitive calculation verification as a promising direction for enhanced mathematical reasoning.

[109] Step-DeepResearch Technical Report

Chen Hu, Haikuo Du, Heng Wang, Lin Lin, Mingrui Chen, Peng Liu, Ruihang Miao, Tianchi Yue, Wang You, Wei Ji, Wei Yuan, Wenjin Deng, Xiaojian Yuan, Xiaoyun Zhang, Xiangyu Liu, Xikai Liu, Yanming Xu, Yicheng Cao, Yifei Zhang, Yongyao Wang, Yubo Shu, Yurong Zhang, Yuxiang Zhang, Zheng Gong, Zhichao Chang, Binyan Li, Dan Ma, Furong Jia, Hongyuan Wang, Jiayu Liu, Jing Bai, Junlan Liu, Manjiao Liu, Na Wang, Qiuping Wu, Qinxin Du, Shiwei Li, Wen Sun, Yifeng Gong, Yonglin Chen, Yuling Zhao, Yuxuan Lin, Ziqi Ren, Zixuan Wang, Aihu Zhang, Brian Li, Buyun Ma, Kang An, Li Xie, Mingliang Li, Pan Li, Shidong Yang, Xi Chen, Xiaojia Liu, Yuchu Luo, Yuan Song, YuanHao Ding, Yuanwei Liang, Zexi Li, Zhaoning Zhang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Jiansheng Chen, Jing Li, Xiangyu Zhang, Yibo Zhu

Main category: cs.CL

TL;DR: Step-DeepResearch is a cost-effective 32B parameter agent for deep research tasks that outperforms comparable models and rivals SOTA closed-source models through refined training techniques and a new Chinese evaluation benchmark.

Details

Motivation: Existing academic benchmarks like BrowseComp fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-source verification. There's also an evaluation gap in the Chinese domain for deep research scenarios.

Method: Introduces Step-DeepResearch agent with: 1) Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing, 2) Progressive training path from agentic mid-training to SFT and RL, 3) Checklist-style Judger for improved robustness, and 4) ADR-Bench for Chinese domain evaluation.

Result: Step-DeepResearch (32B) scores 61.4% on Scale AI Research Rubrics. On ADR-Bench, it significantly outperforms comparable models and rivals SOTA closed-source models like OpenAI and Gemini DeepResearch.

Conclusion: Refined training enables medium-sized models to achieve expert-level capabilities at industry-leading cost-efficiency, proving that well-designed training approaches can make smaller models competitive with larger closed-source alternatives.

Abstract: As LLMs shift toward autonomous agents, Deep Research has emerged as a pivotal metric. However, existing academic benchmarks like BrowseComp often fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-source verification. To address this, we introduce Step-DeepResearch, a cost-effective, end-to-end agent. We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing, combined with a progressive training path from agentic mid-training to SFT and RL. Enhanced by a Checklist-style Judger, this approach significantly improves robustness. Furthermore, to bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios. Experimental results show that Step-DeepResearch (32B) scores 61.4% on Scale AI Research Rubrics. On ADR-Bench, it significantly outperforms comparable models and rivals SOTA closed-source models like OpenAI and Gemini DeepResearch. These findings prove that refined training enables medium-sized models to achieve expert-level capabilities at industry-leading cost-efficiency.

[110] Gamayun’s Path to Multilingual Mastery: Cost-Efficient Training of a 1.5B-Parameter LLM

Alexander Podolskiy, Semen Molokov, Timofey Gerasin, Maksim Titov, Alexey Rukhovich, Artem Khrapov, Kirill Morozov, Evgeny Tetin, Constantine Korikov, Pavel Efimov, Polina Lazukova, Yuliya Skripkar, Nikita Okhotnikov, Irina Piontkovskaya, Meng Xiaojun, Zou Xueyi, Zhang Zhenhe

Main category: cs.CL

TL;DR: Gamayun is a 1.5B-parameter multilingual LLM trained on 2.5T tokens with a novel two-stage pre-training strategy that achieves state-of-the-art results for small models, especially in Russian, outperforming larger models with much bigger training budgets.

Details

Motivation: Addresses the lack of research on small non-English-centric LLMs for resource-constrained environments, focusing on creating efficient multilingual models that perform well across languages despite limited training budgets.

Method: Two-stage pre-training strategy: 1) balanced multilingual training for cross-lingual alignment, 2) high-quality English enrichment to transfer performance gains across languages. Supports 12 languages with special focus on Russian.

Result: Outperforms LLaMA3.2-1B (9T tokens) on all benchmarks, surpasses Qwen2.5-1.5B (18T tokens) on English and multilingual tasks, matches/exceeds Qwen3 (36T tokens) on most non-STEM tasks, achieves SOTA in Russian including MERA benchmark among 1-2B parameter models.

Conclusion: Gamayun demonstrates that efficient multilingual LLMs can be created with smart training strategies and limited budgets, achieving competitive performance against models trained on significantly more tokens, particularly excelling in Russian language tasks.

Abstract: We present Gamayun, a 1.5B-parameter multilingual language model trained entirely from scratch on 2.5T tokens. Designed for efficiency and deployment in resource-constrained environments, Gamayun addresses the lack of research on small non-English-centric LLMs by adopting a novel two-stage pre-training strategy: balanced multilingual training for cross-lingual alignment, followed by high-quality English enrichment to transfer performance gains across languages. Our model supports 12 languages, with special focus on Russian. Despite a significantly smaller training budget than comparable models, Gamayun outperforms LLaMA3.2-1B (9T tokens) on all considered benchmarks, and surpasses Qwen2.5-1.5B (18T tokens) on a wide range of English and multilingual tasks. It matches or exceeds Qwen3 (36T tokens) on most tasks outside advanced STEM, achieving state-of-the-art results in Russian, including the MERA benchmark, among the models of comparable size (1-2B parameters).

cs.CV

[111] Characterizing Motion Encoding in Video Diffusion Timesteps

Vatsal Baherwani, Yixuan Ren, Abhinav Shrivastava

Main category: cs.CV

TL;DR: Researchers systematically characterize how motion is encoded across timesteps in text-to-video diffusion models, identifying early motion-dominant and later appearance-dominant regimes, and use this insight to simplify motion customization.

Details

Motivation: While practitioners use the heuristic that early timesteps shape motion/layout and later ones refine appearance, this behavior hasn't been systematically characterized. Understanding motion encoding across timesteps is crucial for better video synthesis control.

Method: Proxy motion encoding by measuring trade-off between appearance editing and motion preservation when injecting new conditions over specific timestep ranges. Conduct large-scale quantitative study to factor motion from appearance along denoising trajectory.

Result: Consistently identify early motion-dominant regime and later appearance-dominant regime across diverse architectures, establishing operational motion-appearance boundary in timestep space.

Conclusion: The analysis transforms heuristic into spatiotemporal disentanglement principle. Timestep-constrained approach simplifies motion customization by restricting training/inference to motion-dominant regime, achieving strong motion transfer without extra modules.

Abstract: Text-to-video diffusion models synthesize temporal motion and spatial appearance through iterative denoising, yet how motion is encoded across timesteps remains poorly understood. Practitioners often exploit the empirical heuristic that early timesteps mainly shape motion and layout while later ones refine appearance, but this behavior has not been systematically characterized. In this work, we proxy motion encoding in video diffusion timesteps by the trade-off between appearance editing and motion preservation induced when injecting new conditions over specified timestep ranges, and characterize this proxy through a large-scale quantitative study. This protocol allows us to factor motion from appearance by quantitatively mapping how they compete along the denoising trajectory. Across diverse architectures, we consistently identify an early, motion-dominant regime and a later, appearance-dominant regime, yielding an operational motion-appearance boundary in timestep space. Building on this characterization, we simplify current one-shot motion customization paradigm by restricting training and inference to the motion-dominant regime, achieving strong motion transfer without auxiliary debiasing modules or specialized objectives. Our analysis turns a widely used heuristic into a spatiotemporal disentanglement principle, and our timestep-constrained recipe can serve as ready integration into existing motion transfer and editing methods.

[112] Real-Time American Sign Language Recognition Using 3D Convolutional Neural Networks and LSTM: Architecture, Training, and Deployment

Dawnena Key

Main category: cs.CV

TL;DR: Real-time ASL recognition system using 3D CNN-LSTM hybrid architecture achieves high accuracy (F1: 0.71-0.99) on multiple datasets, deployed on AWS and edge devices for accessibility.

Details

Motivation: Address communication barriers for over 70 million deaf and hard-of-hearing individuals worldwide by creating a real-time ASL recognition system that can process webcam video streams.

Method: Hybrid deep learning architecture combining 3D CNNs to capture spatial-temporal features from video frames with LSTM layers to model sequential dependencies in sign language gestures. Trained on WLASL dataset (2,000 words), ASL-LEX lexical database (~2,700 signs), and 100 expert-annotated ASL signs.

Result: Achieves F1-scores ranging from 0.71 to 0.99 across sign classes. System deployed on AWS infrastructure with edge deployment capability on OAK-D cameras for real-time inference.

Conclusion: The hybrid 3D CNN-LSTM architecture effectively recognizes word-level ASL signs in real-time, demonstrating practical deployment potential for accessibility applications with high accuracy across diverse sign datasets.

Abstract: This paper presents a real-time American Sign Language (ASL) recognition system utilizing a hybrid deep learning architecture combining 3D Convolutional Neural Networks (3D CNN) with Long Short-Term Memory (LSTM) networks. The system processes webcam video streams to recognize word-level ASL signs, addressing communication barriers for over 70 million deaf and hard-of-hearing individuals worldwide. Our architecture leverages 3D convolutions to capture spatial-temporal features from video frames, followed by LSTM layers that model sequential dependencies inherent in sign language gestures. Trained on the WLASL dataset (2,000 common words), ASL-LEX lexical database (~2,700 signs), and a curated set of 100 expert-annotated ASL signs, the system achieves F1-scores ranging from 0.71 to 0.99 across sign classes. The model is deployed on AWS infrastructure with edge deployment capability on OAK-D cameras for real-time inference. We discuss the architecture design, training methodology, evaluation metrics, and deployment considerations for practical accessibility applications.

[113] Towards Signboard-Oriented Visual Question Answering: ViSignVQA Dataset, Method and Benchmark

Hieu Minh Nguyen, Tam Le-Thanh Dang, Kiet Van Nguyen

Main category: cs.CV

TL;DR: ViSignVQA: First large-scale Vietnamese signboard VQA dataset with 10,762 images and 25,573 QA pairs, showing OCR integration boosts performance by up to 209% F1-score.

Details

Motivation: Signboard text understanding in natural scenes is crucial for real-world VQA applications but remains underexplored, especially for low-resource languages like Vietnamese. There's a need for domain-specific resources that capture linguistic, cultural, and visual characteristics of Vietnamese signboards.

Method: Created ViSignVQA dataset with diverse Vietnamese signboard images and QA pairs. Adapted state-of-the-art VQA models (BLIP-2, LaTr, PreSTU, SaL) by integrating Vietnamese OCR (SwinTextSpotter) and Vietnamese pretrained language model (ViT5). Also proposed multi-agent VQA framework combining perception and reasoning agents with GPT-4 using majority voting.

Result: OCR-enhanced context significantly improves performance with F1-score improvements up to 209% when OCR text is appended to questions. Multi-agent framework achieved 75.98% accuracy via majority voting. First large-scale multimodal dataset for Vietnamese signboard understanding.

Conclusion: Domain-specific resources are crucial for enhancing text-based VQA in low-resource languages. ViSignVQA serves as benchmark capturing real-world scene text characteristics and supports development/evaluation of OCR-integrated VQA models in Vietnamese.

Abstract: Understanding signboard text in natural scenes is essential for real-world applications of Visual Question Answering (VQA), yet remains underexplored, particularly in low-resource languages. We introduce ViSignVQA, the first large-scale Vietnamese dataset designed for signboard-oriented VQA, which comprises 10,762 images and 25,573 question-answer pairs. The dataset captures the diverse linguistic, cultural, and visual characteristics of Vietnamese signboards, including bilingual text, informal phrasing, and visual elements such as color and layout. To benchmark this task, we adapted state-of-the-art VQA models (e.g., BLIP-2, LaTr, PreSTU, and SaL) by integrating a Vietnamese OCR model (SwinTextSpotter) and a Vietnamese pretrained language model (ViT5). The experimental results highlight the significant role of the OCR-enhanced context, with F1-score improvements of up to 209% when the OCR text is appended to questions. Additionally, we propose a multi-agent VQA framework combining perception and reasoning agents with GPT-4, achieving 75.98% accuracy via majority voting. Our study presents the first large-scale multimodal dataset for Vietnamese signboard understanding. This underscores the importance of domain-specific resources in enhancing text-based VQA for low-resource languages. ViSignVQA serves as a benchmark capturing real-world scene text characteristics and supporting the development and evaluation of OCR-integrated VQA models in Vietnamese.

[114] Enhancing Medical Data Analysis through AI-Enhanced Locally Linear Embedding: Applications in Medical Point Location and Imagery

Hassan Khalid, Muhammad Mahad Khaliq, Muhammad Jawad Bashir

Main category: cs.CV

TL;DR: AI-enhanced Locally Linear Embedding (LLE) improves medical billing and transcription accuracy by processing high-dimensional healthcare data more efficiently.

Details

Motivation: To leverage AI advancements in healthcare to enhance medical billing and transcription processes, reducing human error and streamlining operations for better patient care documentation and financial transactions.

Method: Integration of AI with Locally Linear Embedding (LLE) to create a tailored model for handling high-dimensional medical data, with comprehensive mathematical modeling and real-world experimentation.

Result: Significant improvement in data processing accuracy and operational efficiency in medical billing systems and transcription services.

Conclusion: AI-enhanced LLE shows strong potential for medical data analysis and provides a foundation for broader healthcare applications.

Abstract: The rapid evolution of Artificial intelligence in healthcare has opened avenues for enhancing various processes, including medical billing and transcription. This paper introduces an innovative approach by integrating AI with Locally Linear Embedding (LLE) to revolutionize the handling of high-dimensional medical data. This AI-enhanced LLE model is specifically tailored to improve the accuracy and efficiency of medical billing systems and transcription services. By automating these processes, the model aims to reduce human error and streamline operations, thereby facilitating faster and more accurate patient care documentation and financial transactions. This paper provides a comprehensive mathematical model of AI-enhanced LLE, demonstrating its application in real-world healthcare scenarios through a series of experiments. The results indicate a significant improvement in data processing accuracy and operational efficiency. This study not only underscores the potential of AI-enhanced LLE in medical data analysis but also sets a foundation for future research into broader healthcare applications.

[115] Unbiased Visual Reasoning with Controlled Visual Inputs

Zhaonan Li, Shijie Lu, Fei Wang, Jacob Dineen, Xiao Ye, Zhikun Xu, Siyi Liu, Young Min Cho, Bangzheng Li, Daniel Chang, Kenny Nguyen, Qizheng Yang, Muhao Chen, Ben Zhou

Main category: cs.CV

TL;DR: VISTA is a modular framework that separates perception from reasoning in vision-language models to reduce reliance on spurious correlations, using a frozen VLM for perception queries and a text-only LLM for reasoning, trained with reinforcement learning on minimal data.

Details

Motivation: End-to-end VLMs often exploit spurious correlations instead of causal visual evidence when answering questions, and this problem worsens with fine-tuning. There's a need for more robust visual reasoning that doesn't rely on shortcuts.

Method: VISTA decouples perception from reasoning using an information bottleneck. A frozen VLM sensor provides short, objective perception queries, while a text-only LLM reasoner decomposes questions, plans queries, and aggregates visual facts in natural language. The framework is trained with reinforcement learning (GRPO) using only 641 curated multi-step questions.

Result: VISTA significantly improves robustness to spurious correlations on SpuriVerse (+16.29% with Qwen-2.5-VL-7B and +6.77% with Llama-3.2-Vision-11B), remains competitive on MMVP and SeedBench, transfers across unseen VLM sensors, and can recognize/recover from VLM perception failures. Human analysis shows more neutral, less spurious, and more visually-grounded reasoning.

Conclusion: The modular VISTA framework effectively reduces shortcut learning in VLMs by separating perception from reasoning, enabling more robust visual question answering with minimal training data and good transferability across different vision models.

Abstract: End-to-end Vision-language Models (VLMs) often answer visual questions by exploiting spurious correlations instead of causal visual evidence, and can become more shortcut-prone when fine-tuned. We introduce VISTA (Visual-Information Separation for Text-based Analysis), a modular framework that decouples perception from reasoning via an explicit information bottleneck. A frozen VLM sensor is restricted to short, objective perception queries, while a text-only LLM reasoner decomposes each question, plans queries, and aggregates visual facts in natural language. This controlled interface defines a reward-aligned environment for training unbiased visual reasoning with reinforcement learning. Instantiated with Qwen2.5-VL and Llama3.2-Vision sensors, and trained with GRPO from only 641 curated multi-step questions, VISTA significantly improves robustness to real-world spurious correlations on SpuriVerse (+16.29% with Qwen-2.5-VL-7B and +6.77% with Llama-3.2-Vision-11B), while remaining competitive on MMVP and a balanced SeedBench subset. VISTA transfers robustly across unseen VLM sensors and is able to recognize and recover from VLM perception failures. Human analysis further shows that VISTA’s reasoning traces are more neutral, less reliant on spurious attributes, and more explicitly grounded in visual evidence than end-to-end VLM baselines.

Antara Titikhsha, Divyanshu Tak

Main category: cs.CV

TL;DR: SAMM2D: A dual-encoder framework for intracranial aneurysm detection that achieves 32% improvement over clinical baseline, with surprising finding that data augmentation degrades performance when using strong pretrained backbones.

Details

Motivation: Aneurysm detection is challenging due to subtle morphology, class imbalance, and scarce annotated data. Current approaches often rely on data augmentation, but this may not be optimal with modern pretrained models.

Method: SAMM2D dual-encoder framework using ImageNet-pretrained backbone. Comprehensive ablation study across six augmentation regimes to test the assumption that “more augmentation is always better” in low-data medical settings.

Result: Achieved AUC of 0.686 (32% improvement over baseline). Unaugmented baseline outperformed all augmented variants by 1.75-2.23 percentage points (p<0.01). With threshold calibration, reached 95% sensitivity surpassing radiologist performance. Grad-CAM shows 85% of true positives attend to relevant vascular regions.

Conclusion: ImageNet-pretrained features already capture robust invariances, making additional augmentations redundant and disruptive. Future medical imaging workflows may benefit more from strong pretraining than complex augmentation pipelines. Model translates to significant cost savings in screening applications.

Abstract: Effective aneurysm detection is essential to avert life-threatening hemorrhages, but it remains challenging due to the subtle morphology of the aneurysm, pronounced class imbalance, and the scarcity of annotated data. We introduce SAMM2D, a dual-encoder framework that achieves an AUC of 0.686 on the RSNA intracranial aneurysm dataset; an improvement of 32% over the clinical baseline. In a comprehensive ablation across six augmentation regimes, we made a striking discovery: any form of data augmentation degraded performance when coupled with a strong pretrained backbone. Our unaugmented baseline model outperformed all augmented variants by 1.75–2.23 percentage points (p < 0.01), overturning the assumption that “more augmentation is always better” in low-data medical settings. We hypothesize that ImageNet-pretrained features already capture robust invariances, rendering additional augmentations both redundant and disruptive to the learned feature manifold. By calibrating the decision threshold, SAMM2D reaches 95% sensitivity, surpassing average radiologist performance, and translates to a projected $13.9M in savings per 1,000 patients in screening applications. Grad-CAM visualizations confirm that 85% of true positives attend to relevant vascular regions (62% IoU with expert annotations), demonstrating the model’s clinically meaningful focus. Our results suggest that future medical imaging workflows could benefit more from strong pretraining than from increasingly complex augmentation pipelines.

[117] Bridging Your Imagination with Audio-Video Generation via a Unified Director

Jiaxu Zhang, Tianshu Hu, Yuan Zhang, Zenan Li, Linjie Luo, Guosheng Lin, Xin Chen

Main category: cs.CV

TL;DR: UniMAGE is a unified director model that combines script drafting and key-shot design in a single framework using Mixture-of-Transformers architecture and a novel “first interleaving, then disentangling” training paradigm.

Details

Motivation: Current AI video creation systems treat script drafting (using LLMs) and key-shot design (using image generation models) as separate tasks. The authors argue these should be unified since logical reasoning and imaginative thinking are both essential qualities of a film director, enabling non-experts to create long-context, multi-shot films.

Method: Uses Mixture-of-Transformers architecture to unify text and image generation. Introduces a “first interleaving, then disentangling” training paradigm: 1) Interleaved Concept Learning using interleaved text-image data for deeper script understanding, and 2) Disentangled Expert Learning that decouples script writing from keyframe generation for greater storytelling flexibility.

Result: UniMAGE achieves state-of-the-art performance among open-source models, generating logically coherent video scripts and visually consistent keyframe images.

Conclusion: The proposed unified director model successfully bridges user prompts with well-structured scripts, enabling non-experts to produce high-quality, long-context, multi-shot films by leveraging existing audio-video generation models.

Abstract: Existing AI-driven video creation systems typically treat script drafting and key-shot design as two disjoint tasks: the former relies on large language models, while the latter depends on image generation models. We argue that these two tasks should be unified within a single framework, as logical reasoning and imaginative thinking are both fundamental qualities of a film director. In this work, we propose UniMAGE, a unified director model that bridges user prompts with well-structured scripts, thereby empowering non-experts to produce long-context, multi-shot films by leveraging existing audio-video generation models. To achieve this, we employ the Mixture-of-Transformers architecture that unifies text and image generation. To further enhance narrative logic and keyframe consistency, we introduce a ``first interleaving, then disentangling’’ training paradigm. Specifically, we first perform Interleaved Concept Learning, which utilizes interleaved text-image data to foster the model’s deeper understanding and imaginative interpretation of scripts. We then conduct Disentangled Expert Learning, which decouples script writing from keyframe generation, enabling greater flexibility and creativity in storytelling. Extensive experiments demonstrate that UniMAGE achieves state-of-the-art performance among open-source models, generating logically coherent video scripts and visually consistent keyframe images.

[118] HookMIL: Revisiting Context Modeling in Multiple Instance Learning for Computational Pathology

Xitong Ling, Minxi Ouyang, Xiaoxiao Li, Jiawen Li, Ying Chen, Yuxuan Sun, Xinrui Chen, Tian Guan, Xiaoping Liu, Yonghong He

Main category: cs.CV

TL;DR: HookMIL: A context-aware, computationally efficient Multiple Instance Learning framework for pathology that uses learnable hook tokens for structured contextual aggregation with multimodal initialization and linear complexity attention.

Details

Motivation: Traditional MIL approaches lose crucial contextual information in whole-slide images, while transformer-based variants suffer from quadratic complexity and redundant computations. There's a need for a framework that preserves context while being computationally efficient.

Method: Proposes HookMIL with learnable hook tokens that can be initialized from: (1) key-patch visual features, (2) text embeddings from vision-language pathology models, (3) spatially grounded features from spatial transcriptomics-vision models. Uses bidirectional attention with linear complexity, Hook Diversity Loss to encourage specialization, and hook-to-hook communication mechanism to refine contextual interactions.

Result: Extensive experiments on four public pathology datasets demonstrate state-of-the-art performance with improved computational efficiency and interpretability.

Conclusion: HookMIL provides an effective solution for weakly supervised analysis of whole-slide images by balancing context preservation with computational efficiency through innovative hook token mechanisms and multimodal initialization strategies.

Abstract: Multiple Instance Learning (MIL) has enabled weakly supervised analysis of whole-slide images (WSIs) in computational pathology. However, traditional MIL approaches often lose crucial contextual information, while transformer-based variants, though more expressive, suffer from quadratic complexity and redundant computations. To address these limitations, we propose HookMIL, a context-aware and computationally efficient MIL framework that leverages compact, learnable hook tokens for structured contextual aggregation. These tokens can be initialized from (i) key-patch visual features, (ii) text embeddings from vision-language pathology models, and (iii) spatially grounded features from spatial transcriptomics-vision models. This multimodal initialization enables Hook Tokens to incorporate rich textual and spatial priors, accelerating convergence and enhancing representation quality. During training, Hook tokens interact with instances through bidirectional attention with linear complexity. To further promote specialization, we introduce a Hook Diversity Loss that encourages each token to focus on distinct histopathological patterns. Additionally, a hook-to-hook communication mechanism refines contextual interactions while minimizing redundancy. Extensive experiments on four public pathology datasets demonstrate that HookMIL achieves state-of-the-art performance, with improved computational efficiency and interpretability. Codes are available at https://github.com/lingxitong/HookMIL.

[119] RealX3D: A Physically-Degraded 3D Benchmark for Multi-view Visual Restoration and Reconstruction

Shuhong Liu, Chenyu Bao, Ziteng Cui, Yun Liu, Xuangeng Chu, Lin Gu, Marcos V. Conde, Ryo Umagami, Tomohiro Hashimoto, Zijian Hu, Tianhan Xu, Yuan Gan, Yusuke Kurose, Tatsuya Harada

Main category: cs.CV

TL;DR: RealX3D is a real-capture benchmark for evaluating multi-view visual restoration and 3D reconstruction under diverse physical degradations, featuring pixel-aligned LQ/GT views, high-resolution captures, RAW images, and dense laser scans.

Details

Motivation: Current multi-view reconstruction pipelines are fragile in real-world challenging environments with physical corruptions, but existing benchmarks lack real-capture data with diverse physical degradations and aligned ground truth.

Method: Created RealX3D benchmark with four corruption families (illumination, scattering, occlusion, blurring) captured at multiple severity levels using unified acquisition protocol. Includes high-resolution capture, RAW images, dense laser scans, world-scale meshes, and metric depth with pixel-aligned LQ/GT views.

Result: Benchmarking shows substantial degradation in reconstruction quality under physical corruptions for both optimization-based and feed-forward methods, highlighting the fragility of current multi-view pipelines in real-world challenging environments.

Conclusion: RealX3D provides a comprehensive real-capture benchmark that reveals significant weaknesses in current multi-view reconstruction methods when faced with physical degradations, emphasizing the need for more robust approaches for real-world applications.

Abstract: We introduce RealX3D, a real-capture benchmark for multi-view visual restoration and 3D reconstruction under diverse physical degradations. RealX3D groups corruptions into four families, including illumination, scattering, occlusion, and blurring, and captures each at multiple severity levels using a unified acquisition protocol that yields pixel-aligned LQ/GT views. Each scene includes high-resolution capture, RAW images, and dense laser scans, from which we derive world-scale meshes and metric depth. Benchmarking a broad range of optimization-based and feed-forward methods shows substantial degradation in reconstruction quality under physical corruptions, underscoring the fragility of current multi-view pipelines in real-world challenging environments.

[120] Tiny-YOLOSAM: Fast Hybrid Image Segmentation

Kenneth Xu, Songhan Wu

Main category: cs.CV

TL;DR: Tiny-YOLOSAM: A hybrid pipeline combining YOLO detection with TinySAM for faster full-scene segmentation, achieving 4.7x speedup while improving coverage metrics.

Details

Motivation: SAM and TinySAM are computationally expensive for latency-critical applications, and TinySAM's "segment-everything" mode remains slow despite being distilled. There's a need for more efficient full-scene segmentation approaches.

Method: 1) Replicated TinySAM baseline on COCO val2017; 2) Proposed Tiny-YOLOSAM hybrid pipeline using YOLOv12 detector to generate box prompts for salient objects; 3) Supplemented uncovered regions with sparse point prompts only where YOLO-guided masks lack coverage.

Result: On COCO val2017: Class-agnostic AR improved from 16.4% to 77.1%, mIoU from 19.2% to 67.8%. End-to-end runtime reduced from 49.20s/image to 10.39s/image (4.7x speedup) on Apple M1 Pro CPU.

Conclusion: Detector-guided prompting combined with targeted sparse sampling is an effective alternative to dense “segment-everything” prompting for practical full-scene segmentation, offering significant speed improvements while maintaining good coverage.

Abstract: The Segment Anything Model (SAM) enables promptable, high-quality segmentation but is often too computationally expensive for latency-critical settings. TinySAM is a lightweight, distilled SAM variant that preserves strong zero-shot mask quality, yet its “segment-everything” mode still requires hundreds of prompts and remains slow in practice. We first replicate TinySAM on COCO val2017 using official checkpoints, matching the reported AP within 0.03%, establishing a reliable experimental baseline. Building on this, we propose Tiny-YOLOSAM, a fast hybrid pipeline that uses a recent YOLO detector (YOLOv12) to generate box prompts for TinySAM on salient foreground objects, and supplements uncovered regions with sparse point prompts sampled only where YOLO-guided masks provide no coverage. On COCO val2017, the hybrid system substantially improves class-agnostic coverage (AR from 16.4% to 77.1%, mIoU from 19.2% to 67.8%) while reducing end-to-end runtime from 49.20s/image to 10.39s/image (4.7x) on an Apple M1 Pro CPU. These results suggest detector-guided prompting combined with targeted sparse sampling as an effective alternative to dense “segment-everything” prompting for practical full-scene segmentation.

[121] Quadrant Segmentation VLM with Few-Shot Adaptation and OCT Learning-based Explainability Methods for Diabetic Retinopathy

Shivum Telang

Main category: cs.CV

TL;DR: A multimodal explainable AI model for diabetic retinopathy diagnosis that combines fundus and OCT images with natural language lesion descriptions, mimicking ophthalmologist reasoning through few-shot learning and Grad-CAM visualizations.

Details

Motivation: Current DR diagnostic models lack interpretability for clinicians - they either require impractical manual lesion annotation or provide insufficient reasoning explanations. Existing models are one-dimensional (single modality) and don't explain classification reasoning in ways physicians can understand, limiting clinical adoption.

Method: Multimodal explainability model using Vision-Language Model (VLM) with few-shot learning. Analyzes lesion distributions within retinal quadrants from fundus images, generates paired Grad-CAM heatmaps showing individual neuron weights across both OCT and fundus images, and provides natural language lesion identification.

Result: Developed a comprehensive diagnostic tool using 3,000 fundus images and 1,000 OCT images that visually highlights regions contributing to DR severity classification through heatmaps while providing natural language explanations of individual DR lesions.

Conclusion: The proposed multimodal explainability model addresses key limitations in current DR diagnostics by providing physician-interpretable reasoning, enabling diverse applications in screening, treatment, and research while improving patient outcomes through practical clinical adoption.

Abstract: Diabetic Retinopathy (DR) is a leading cause of vision loss worldwide, requiring early detection to preserve sight. Limited access to physicians often leaves DR undiagnosed. To address this, AI models utilize lesion segmentation for interpretability; however, manually annotating lesions is impractical for clinicians. Physicians require a model that explains the reasoning for classifications rather than just highlighting lesion locations. Furthermore, current models are one-dimensional, relying on a single imaging modality for explainability and achieving limited effectiveness. In contrast, a quantitative-detection system that identifies individual DR lesions in natural language would overcome these limitations, enabling diverse applications in screening, treatment, and research settings. To address this issue, this paper presents a novel multimodal explainability model utilizing a VLM with few-shot learning, which mimics an ophthalmologist’s reasoning by analyzing lesion distributions within retinal quadrants for fundus images. The model generates paired Grad-CAM heatmaps, showcasing individual neuron weights across both OCT and fundus images, which visually highlight the regions contributing to DR severity classification. Using a dataset of 3,000 fundus images and 1,000 OCT images, this innovative methodology addresses key limitations in current DR diagnostics, offering a practical and comprehensive tool for improving patient outcomes.

[122] Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework

Jing Wang, Fengzhuo Zhang, Xiaoli Li, Vincent Y. F. Tan, Tianyu Pang, Chao Du, Aixin Sun, Zhuoran Yang

Main category: cs.CV

TL;DR: AR-VDMs suffer from history forgetting and temporal degradation. Meta-ARVDM provides theoretical analysis showing history forgetting relates to conditional mutual information, and temporal degradation to cumulative error. New evaluation protocol introduced, with empirical correlation found between both issues.

Details

Motivation: Auto-Regressive Video Diffusion Models (AR-VDMs) generate realistic videos but suffer from history forgetting (losing track of previous content) and temporal degradation (quality deterioration over time). Existing empirical understanding lacks rigorous theoretical analysis, and standard metrics fail to capture these phenomena effectively.

Method: Introduces Meta-ARVDM, a unified analytical framework that studies both errors through the shared autoregressive structure of AR-VDMs. Theoretically characterizes history forgetting using conditional mutual information between generated output and preceding frames. Quantifies temporal degradation through cumulative sum of per-step errors. Proposes new evaluation protocol using “needle-in-a-haystack” tasks in closed-ended environments (DMLab and Minecraft).

Result: Proves that incorporating more past frames monotonically alleviates history forgetting, justifying existing empirical practices. Shows standard metrics fail to capture history forgetting effects. Enables prediction of temporal degradation for different schedulers without video rollout. Uncovers strong empirical correlation between history forgetting and temporal degradation, a previously unreported connection.

Conclusion: Meta-ARVDM provides rigorous theoretical foundation for understanding AR-VDM limitations. The framework offers analytical tools to quantify both history forgetting and temporal degradation, explains why existing metrics are insufficient, and reveals the connection between these two phenomena. The work advances theoretical understanding of autoregressive video generation models.

Abstract: Auto-Regressive Video Diffusion Models (AR-VDMs) have shown strong capabilities in generating long, photorealistic videos, but suffer from two key limitations: (i) history forgetting, where the model loses track of previously generated content, and (ii) temporal degradation, where frame quality deteriorates over time. Yet a rigorous theoretical analysis of these phenomena is lacking, and existing empirical understanding remains insufficiently grounded. In this paper, we introduce Meta-ARVDM, a unified analytical framework that studies both errors through the shared autoregressive structure of AR-VDMs. We show that history forgetting is characterized by the conditional mutual information between the generated output and preceding frames, conditioned on inputs, and prove that incorporating more past frames monotonically alleviates history forgetting, thereby theoretically justifying a common belief in existing works. Moreover, our theory reveals that standard metrics fail to capture this effect, motivating a new evaluation protocol based on a ``needle-in-a-haystack’’ task in closed-ended environments (DMLab and Minecraft). We further show that temporal degradation can be quantified by the cumulative sum of per-step errors, enabling prediction of degradation for different schedulers without video rollout. Finally, our evaluation uncovers a strong empirical correlation between history forgetting and temporal degradation, a connection not previously reported.

[123] TCFormer: A 5M-Parameter Transformer with Density-Guided Aggregation for Weakly-Supervised Crowd Counting

Qiang Guo, Rubo Zhang, Bingbing Zhang, Junjie Liu, Jianqing Liu

Main category: cs.CV

TL;DR: TCFormer is a tiny, ultra-lightweight (5M params) weakly-supervised transformer for crowd counting that achieves competitive performance using only image-level counts, making it suitable for edge devices.

Details

Motivation: Traditional crowd counting methods require labor-intensive point-level annotations and computationally heavy backbones, limiting scalability and deployment in resource-constrained environments.

Method: Uses efficient vision transformer as feature extractor, Learnable Density-Weighted Averaging module to dynamically re-weight local tokens based on predicted density, and density-level classification loss to discretize crowd density into grades for regularization.

Result: Extensive experiments on ShanghaiTech A/B, UCF-QNRF, and NWPU datasets show superior trade-off between parameter efficiency and counting accuracy, demonstrating competitive performance despite minimal parameters.

Conclusion: TCFormer provides an effective solution for edge device crowd counting by combining weak supervision, transformer architecture, and novel density-aware mechanisms to achieve high accuracy with minimal computational requirements.

Abstract: Crowd counting typically relies on labor-intensive point-level annotations and computationally intensive backbones, restricting its scalability and deployment in resource-constrained environments. To address these challenges, this paper proposes the TCFormer, a tiny, ultra-lightweight, weakly-supervised transformer-based crowd counting framework with only 5 million parameters that achieves competitive performance. Firstly, a powerful yet efficient vision transformer is adopted as the feature extractor, the global context-aware capabilities of which provides semantic meaningful crowd features with a minimal memory footprint. Secondly, to compensate for the lack of spatial supervision, we design a feature aggregation mechanism termed the Learnable Density-Weighted Averaging module. This module dynamically re-weights local tokens according to predicted density scores, enabling the network to adaptively modulate regional features based on their specific density characteristics without the need for additional annotations. Furthermore, this paper introduces a density-level classification loss, which discretizes crowd density into distinct grades, thereby regularizing the training process and enhancing the model’s classification power across varying levels of crowd density. Therefore, although TCformer is trained under a weakly-supervised paradigm utilizing only image-level global counts, the joint optimization of count and density-level losses enables the framework to achieve high estimation accuracy. Extensive experiments on four benchmarks including ShanghaiTech A/B, UCF-QNRF, and NWPU datasets demonstrate that our approach strikes a superior trade-off between parameter efficiency and counting accuracy and can be a good solution for crowd counting tasks in edge devices.

[124] ClassWise-CRF: Category-Specific Fusion for Enhanced Semantic Segmentation of Remote Sensing Imagery

Qinfeng Zhu, Yunxi Jiang, Lei Fan

Main category: cs.CV

TL;DR: ClassWise-CRF: A two-stage category-specific fusion architecture that selects expert networks and fuses their predictions using adaptive weighting based on CRF principles, improving semantic segmentation performance on remote sensing datasets.

Details

Motivation: To improve semantic segmentation performance in remote sensing images by leveraging multiple networks' strengths for different categories, addressing the limitation that individual networks may excel in specific categories but not all.

Method: Two-stage approach: 1) Greedy algorithm selects expert networks from candidate pool based on category performance; 2) Adaptive fusion using CRF-inspired framework with exponential weighting of confidence scores based on validation metrics, plus spatial consistency optimization with unary and pairwise potentials.

Result: Significant mIoU improvements: LoveDA - 1.00% (val) & 0.68% (test); Vaihingen - 0.87% (val) & 0.91% (test). Demonstrated effectiveness across 8 segmentation networks on two remote sensing datasets.

Conclusion: ClassWise-CRF effectively improves semantic segmentation performance by category-specific network fusion, demonstrating strong generality and effectiveness for remote sensing image analysis.

Abstract: We propose a result-level category-specific fusion architecture called ClassWise-CRF. This architecture employs a two-stage process: first, it selects expert networks that perform well in specific categories from a pool of candidate networks using a greedy algorithm; second, it integrates the segmentation predictions of these selected networks by adaptively weighting their contributions based on their segmentation performance in each category. Inspired by Conditional Random Field (CRF), the ClassWise-CRF architecture treats the segmentation predictions from multiple networks as confidence vector fields. It leverages segmentation metrics (such as Intersection over Union) from the validation set as priors and employs an exponential weighting strategy to fuse the category-specific confidence scores predicted by each network. This fusion method dynamically adjusts the weights of each network for different categories, achieving category-specific optimization. Building on this, the architecture further optimizes the fused results using unary and pairwise potentials in CRF to ensure spatial consistency and boundary accuracy. To validate the effectiveness of ClassWise-CRF, we conducted experiments on two remote sensing datasets, LoveDA and Vaihingen, using eight classic and advanced semantic segmentation networks. The results show that the ClassWise-CRF architecture significantly improves segmentation performance: on the LoveDA dataset, the mean Intersection over Union (mIoU) metric increased by 1.00% on the validation set and by 0.68% on the test set; on the Vaihingen dataset, the mIoU improved by 0.87% on the validation set and by 0.91% on the test set. These results fully demonstrate the effectiveness and generality of the ClassWise-CRF architecture in semantic segmentation of remote sensing images. The full code is available at https://github.com/zhuqinfeng1999/ClassWise-CRF.

[125] A CNN-Based Malaria Diagnosis from Blood Cell Images with SHAP and LIME Explainability

Md. Ismiel Hossen Abir, Awolad Hossain

Main category: cs.CV

TL;DR: Deep learning CNN achieves 96% accuracy for malaria diagnosis from blood cell images, outperforming established architectures and using Explainable AI for interpretability.

Details

Motivation: Traditional malaria diagnosis methods like microscopic blood smear analysis have low sensitivity, depend on expert judgment, and lack resources in remote settings, requiring automated, accurate alternatives.

Method: Proposes a custom Convolutional Neural Network (CNN) to classify blood cell images as parasitized or uninfected, with comparison to ResNet50, VGG16, MobileNetV2, and DenseNet121, plus Explainable AI techniques (SHAP, LIME, Saliency Maps) for interpretability.

Result: Custom CNN achieves 96% accuracy with precision and recall scores exceeding 0.95 for both classes, demonstrating superior performance for automated malaria diagnosis.

Conclusion: Deep learning can provide quick, accurate, and understandable malaria diagnosis, especially valuable in resource-limited areas where traditional methods are impractical.

Abstract: Malaria remains a prevalent health concern in regions with tropical and subtropical climates. The cause of malaria is the Plasmodium parasite, which is transmitted through the bites of infected female Anopheles mosquitoes. Traditional diagnostic methods, such as microscopic blood smear analysis, are low in sensitivity, depend on expert judgment, and require resources that may not be available in remote settings. To overcome these limitations, this study proposes a deep learning-based approach utilizing a custom Convolutional Neural Network (CNN) to automatically classify blood cell images as parasitized or uninfected. The model achieves an accuracy of 96%, with precision and recall scores exceeding 0.95 for both classes. This study also compares the custom CNN with established deep learning architectures, including ResNet50, VGG16, MobileNetV2, and DenseNet121. To enhance model interpretability, Explainable AI techniques such as SHAP, LIME, and Saliency Maps are applied. The proposed system shows how deep learning can provide quick, accurate and understandable malaria diagnosis, especially in areas with limited resources.

[126] D-FCGS: Feedforward Compression of Dynamic Gaussian Splatting for Free-Viewpoint Videos

Wenkang Zhang, Yan Zhao, Qiang Wang, Zhixin Xu, Li Song, Zhengxue Cheng

Main category: cs.CV

TL;DR: D-FCGS is a feedforward compression framework for Dynamic Gaussian Splatting that achieves over 17x compression while maintaining visual quality across viewpoints, using standardized GoF structure and dual prior-aware entropy modeling.

Details

Motivation: Existing dynamic 3D Gaussian Splatting methods couple reconstruction with optimization-dependent compression and customized motion formats, limiting generalization and standardization for efficient Free-Viewpoint Video compression.

Method: Proposes D-FCGS with: (1) standardized Group-of-Frames structure with I-P coding using sparse control points to extract motion tensors; (2) dual prior-aware entropy model combining hyperprior and spatial-temporal priors; (3) control-point-guided motion compensation and refinement network.

Result: Matches rate-distortion performance of optimization-based methods, achieves over 17 times compression compared to baseline while preserving visual quality across viewpoints, and generalizes across diverse scenes in zero-shot fashion.

Conclusion: D-FCGS advances feedforward compression of dynamic 3DGS, facilitating scalable Free-Viewpoint Video transmission and storage for immersive applications by providing standardized, generalizable compression framework.

Abstract: Free-Viewpoint Video (FVV) enables immersive 3D experiences, but efficient compression of dynamic 3D representation remains a major challenge. Existing dynamic 3D Gaussian Splatting methods couple reconstruction with optimization-dependent compression and customized motion formats, limiting generalization and standardization. To address this, we propose D-FCGS, a novel Feedforward Compression framework for Dynamic Gaussian Splatting. Key innovations include: (1) a standardized Group-of-Frames (GoF) structure with I-P coding, leveraging sparse control points to extract inter-frame motion tensors; (2) a dual prior-aware entropy model that fuses hyperprior and spatial-temporal priors for accurate rate estimation; (3) a control-point-guided motion compensation mechanism and refinement network to enhance view-consistent fidelity. Trained on Gaussian frames derived from multi-view videos, D-FCGS generalizes across diverse scenes in a zero-shot fashion. Experiments show that it matches the rate-distortion performance of optimization-based methods, achieving over 17 times compression compared to the baseline while preserving visual quality across viewpoints. This work advances feedforward compression of dynamic 3DGS, facilitating scalable FVV transmission and storage for immersive applications.

[127] Real-Time In-Cabin Driver Behavior Recognition on Low-Cost Edge Hardware

Vesal Ahsani, Babak Hossein Khalaj

Main category: cs.CV

TL;DR: A real-time driver monitoring system for low-cost edge devices that detects 17 distraction/drowsiness behaviors using a compact vision model with confounder-aware labels and temporal decision logic.

Details

Motivation: Need for in-cabin driver monitoring systems that can recognize distraction and drowsiness behaviors with low latency while meeting strict compute, power, and cost constraints for practical deployment on affordable edge hardware.

Method: Three-part pipeline: (1) compact per-frame vision model optimized for edge deployment, (2) confounder-aware label design to reduce visually similar false positives, and (3) temporal decision head that triggers alerts only when predictions are both confident and sustained over time.

Result: Achieves real-time performance: ~16 FPS on Raspberry Pi 5 with INT8 inference (per-frame latency <60 ms) and ~25 FPS on Coral Edge TPU. System covers 17 behavior classes and was validated on diverse datasets and real in-vehicle tests.

Conclusion: Demonstrates practical real-time driver monitoring on low-cost hardware, enabling reliable human-state perception as upstream input for human-centered vehicle intelligence and emerging agentic vehicle concepts.

Abstract: In-cabin Driver Monitoring Systems (DMS) must recognize distraction- and drowsiness-related behaviors with low latency under strict constraints on compute, power, and cost. We present a single-camera in-cabin driver behavior recognition system designed for deployment on two low-cost edge platforms: Raspberry Pi 5 (CPU-only) and Google Coral Edge TPU. The proposed pipeline combines (i) a compact per-frame vision model, (ii) a confounder-aware label design to reduce visually similar false positives, and (iii) a temporal decision head that triggers alerts only when predictions are both confident and sustained. The system covers 17 behavior classes, including multiple phone-use modes, eating/drinking, smoking, reaching behind, gaze/attention shifts, passenger interaction, grooming, control-panel interaction, yawning, and eyes-closed sleep. Training and evaluation use licensed datasets spanning diverse drivers, vehicles, and lighting conditions (details in Section 6), and we further validate runtime behavior in real in-vehicle tests. The optimized deployments achieve about 16 FPS on Raspberry Pi 5 with INT8 inference (per-frame latency under 60 ms) and about 25 FPS on Coral Edge TPU, enabling real-time monitoring and stable alert generation on inexpensive hardware. Finally, we discuss how reliable in-cabin human-state perception can serve as an upstream input for human-centered vehicle intelligence, including emerging agentic vehicle concepts.

[128] Signal-SGN++: Topology-Enhanced Time-Frequency Spiking Graph Network for Skeleton-Based Action Recognition

Naichuan Zheng, Xiahai Lun, Weiyi Li, Yuchen Du

Main category: cs.CV

TL;DR: Signal-SGN++ is a spiking graph neural network framework for action recognition that combines topology-aware attention with time-frequency spiking dynamics to achieve energy-efficient yet accurate skeletal motion analysis.

Details

Motivation: GCNs are effective for skeletal action recognition but energy-intensive due to dense floating-point computations. SNNs offer energy efficiency through event-driven sparse activation but struggle to capture coupled temporal-frequency and topological dependencies of human motion. There's a need to bridge this gap.

Method: Proposes Signal-SGN++ with: 1) 1D Spiking Graph Convolution (1D-SGC) and Frequency Spiking Convolution (FSC) backbone for joint spatiotemporal and spectral feature extraction; 2) Topology-Shift Self-Attention (TSSA) mechanism for adaptive attention routing across learned skeletal topologies; 3) Multi-Scale Wavelet Transform Fusion (MWTF) branch with Topology-Aware Time-Frequency Fusion (TATF) unit for multi-resolution temporal-frequency representations with structural priors.

Result: Achieves superior accuracy-efficiency trade-offs on large-scale benchmarks, outperforms existing SNN-based methods, and achieves competitive results against state-of-the-art GCNs with substantially reduced energy consumption.

Conclusion: Signal-SGN++ successfully bridges the gap between energy-efficient SNNs and topology-aware GCNs for action recognition, demonstrating that spiking neural networks can achieve competitive performance while maintaining energy efficiency through innovative topology-aware time-frequency fusion mechanisms.

Abstract: Graph Convolutional Networks (GCNs) demonstrate strong capability in modeling skeletal topology for action recognition, yet their dense floating-point computations incur high energy costs. Spiking Neural Networks (SNNs), characterized by event-driven and sparse activation, offer energy efficiency but remain limited in capturing coupled temporal-frequency and topological dependencies of human motion. To bridge this gap, this article proposes Signal-SGN++, a topology-aware spiking graph framework that integrates structural adaptivity with time-frequency spiking dynamics. The network employs a backbone composed of 1D Spiking Graph Convolution (1D-SGC) and Frequency Spiking Convolution (FSC) for joint spatiotemporal and spectral feature extraction. Within this backbone, a Topology-Shift Self-Attention (TSSA) mechanism is embedded to adaptively route attention across learned skeletal topologies, enhancing graph-level sensitivity without increasing computational complexity. Moreover, an auxiliary Multi-Scale Wavelet Transform Fusion (MWTF) branch decomposes spiking features into multi-resolution temporal-frequency representations, wherein a Topology-Aware Time-Frequency Fusion (TATF) unit incorporates structural priors to preserve topology-consistent spectral fusion. Comprehensive experiments on large-scale benchmarks validate that Signal-SGN++ achieves superior accuracy-efficiency trade-offs, outperforming existing SNN-based methods and achieving competitive results against state-of-the-art GCNs under substantially reduced energy consumption.

[129] VLM-PAR: A Vision Language Model for Pedestrian Attribute Recognition

Abdellah Zakaria Sellam, Salah Eddine Bekhouche, Fadi Dornaika, Cosimo Distante, Abdenour Hadid

Main category: cs.CV

TL;DR: VLM-PAR is a vision-language framework using frozen SigLIP 2 multilingual encoders that achieves state-of-the-art performance on pedestrian attribute recognition by addressing class imbalance and domain shifts through cross-modal refinement.

Details

Motivation: Pedestrian Attribute Recognition (PAR) faces three major challenges: severe class imbalance (some attributes are much more common than others), intricate attribute co-dependencies, and domain shifts between different datasets or environments. These issues hinder accurate attribute prediction from pedestrian imagery.

Method: VLM-PAR is a modular vision-language framework built on frozen SigLIP 2 multilingual encoders. The key innovation is aligning image and prompt embeddings through a compact cross-attention fusion mechanism that refines visual features. This approach leverages large-scale vision-language pretraining while adding targeted cross-modal refinement.

Result: The method achieves significant accuracy improvement on the highly imbalanced PA100K benchmark, setting a new state-of-the-art performance. It also delivers significant gains in mean accuracy across PETA and Market-1501 benchmarks, demonstrating strong generalization capabilities.

Conclusion: Integrating large-scale vision-language pretraining with targeted cross-modal refinement effectively overcomes imbalance and generalization challenges in PAR. The frozen encoder approach combined with compact fusion modules provides an efficient and powerful solution for attribute recognition tasks.

Abstract: Pedestrian Attribute Recognition (PAR) involves predicting fine-grained attributes such as clothing color, gender, and accessories from pedestrian imagery, yet is hindered by severe class imbalance, intricate attribute co-dependencies, and domain shifts. We introduce VLM-PAR, a modular vision-language framework built on frozen SigLIP 2 multilingual encoders. By first aligning image and prompt embeddings via refining visual features through a compact cross-attention fusion, VLM-PAR achieves significant accuracy improvement on the highly imbalanced PA100K benchmark, setting a new state-of-the-art performance, while also delivering significant gains in mean accuracy across PETA and Market-1501 benchmarks. These results underscore the efficacy of integrating large-scale vision-language pretraining with targeted cross-modal refinement to overcome imbalance and generalization challenges in PAR.

[130] On Extending Semantic Abstraction for Efficient Search of Hidden Objects

Tasha Pais, Nikhilesh Belulkar

Main category: cs.CV

TL;DR: Semantic Abstraction uses 2D VLMs’ relevancy maps as “abstract object” representations to learn 3D localization and completion for hidden objects (occluded objects), enabling efficient unstructured search using historical placement data.

Details

Motivation: To enable household robots to efficiently find lost or hidden objects that are occluded and cannot be directly identified by vision-language models, saving time and effort in search tasks.

Method: Treats 2D VLM relevancy activations as “abstract object” representations, uses this framework for 3D localization and completion of hidden objects, and employs historical placement data to make unstructured search more efficient.

Result: The model can accurately identify the complete 3D location of a hidden object on the first try significantly faster than naive random search approaches.

Conclusion: Semantic Abstraction extensions provide household robots with improved skills for finding hidden objects efficiently, leveraging abstract object representations and historical data to optimize search processes.

Abstract: Semantic Abstraction’s key observation is that 2D VLMs’ relevancy activations roughly correspond to their confidence of whether and where an object is in the scene. Thus, relevancy maps are treated as “abstract object” representations. We use this framework for learning 3D localization and completion for the exclusive domain of hidden objects, defined as objects that cannot be directly identified by a VLM because they are at least partially occluded. This process of localizing hidden objects is a form of unstructured search that can be performed more efficiently using historical data of where an object is frequently placed. Our model can accurately identify the complete 3D location of a hidden object on the first try significantly faster than a naive random search. These extensions to semantic abstraction hope to provide household robots with the skills necessary to save time and effort when looking for lost objects.

[131] VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs

Naishan Zheng, Jie Huang, Qingpei Guo, Feng Zhao

Main category: cs.CV

TL;DR: VideoScaffold is a dynamic representation framework for streaming video understanding that adaptively adjusts event granularity and preserves visual semantics through elastic event segmentation and hierarchical consolidation.

Details

Motivation: Existing static video understanding methods (sparse sampling, frame compression, clustering) are optimized for offline settings and produce fragmented or over-compressed outputs when applied to continuous video streams, failing to handle temporal coherence and redundancy effectively.

Method: VideoScaffold introduces two key components: 1) Elastic-Scale Event Segmentation (EES) - performs prediction-guided segmentation to dynamically refine event boundaries, and 2) Hierarchical Event Consolidation (HEC) - progressively aggregates semantically related segments into multi-level abstractions, enabling smooth transition from fine-grained frame understanding to abstract event reasoning.

Result: Extensive experiments across both offline and streaming video understanding benchmarks demonstrate that VideoScaffold achieves state-of-the-art performance, with the framework being modular and plug-and-play for extending existing image-based MLLMs to continuous video comprehension.

Conclusion: VideoScaffold provides an effective dynamic representation framework for streaming video understanding that addresses the limitations of static approaches by adaptively adjusting event granularity while preserving visual semantics, enabling better temporal coherence and reducing redundancy in continuous video streams.

Abstract: Understanding long videos with multimodal large language models (MLLMs) remains challenging due to the heavy redundancy across frames and the need for temporally coherent representations. Existing static strategies, such as sparse sampling, frame compression, and clustering, are optimized for offline settings and often produce fragmented or over-compressed outputs when applied to continuous video streams. We present VideoScaffold, a dynamic representation framework designed for streaming video understanding. It adaptively adjusts event granularity according to video duration while preserving fine-grained visual semantics. VideoScaffold introduces two key components: Elastic-Scale Event Segmentation (EES), which performs prediction-guided segmentation to dynamically refine event boundaries, and Hierarchical Event Consolidation (HEC), which progressively aggregates semantically related segments into multi-level abstractions. Working in concert, EES and HEC enable VideoScaffold to transition smoothly from fine-grained frame understanding to abstract event reasoning as the video stream unfolds. Extensive experiments across both offline and streaming video understanding benchmarks demonstrate that VideoScaffold achieves state-of-the-art performance. The framework is modular and plug-and-play, seamlessly extending existing image-based MLLMs to continuous video comprehension. The code is available at https://github.com/zheng980629/VideoScaffold.

[132] Improved cystic hygroma detection from prenatal imaging using ultrasound-specific self-supervised representation learning

Youssef Megahed, Robin Ducharme, Inok Lee, Inbal Willner, Olivier X. Miguel, Kevin Dick, Adrian D. C. Chan, Mark Walker, Steven Hawken

Main category: cs.CV

TL;DR: Self-supervised pretraining (USF-MAE) on 370K unlabeled ultrasound images improves automated detection of cystic hygroma in first-trimester scans, outperforming supervised DenseNet-169 baseline with 96% accuracy.

Details

Motivation: Cystic hygroma is a high-risk prenatal finding with poor outcomes. Supervised deep learning is limited by small labeled datasets. Self-supervised pretraining could enable more accurate, scalable automated detection.

Method: Fine-tuned Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE) pretrained on 370K+ unlabeled ultrasound images for binary classification of normal vs cystic hygroma cases. Compared to DenseNet-169 baseline using same dataset, preprocessing, and 4-fold cross-validation.

Result: USF-MAE outperformed DenseNet-169 on all metrics: accuracy (0.96 vs 0.93), sensitivity (0.94 vs 0.92), specificity (0.98 vs 0.94), ROC-AUC (0.98 vs 0.94). Score-CAM visualizations showed clinically relevant attention to fetal neck regions. Improvements were statistically significant (p=0.0057).

Conclusion: Ultrasound-specific self-supervised pretraining enables accurate, robust deep learning detection of cystic hygroma, overcoming limitations of small labeled datasets and supporting scalable early screening programs.

Abstract: Cystic hygroma is a high-risk prenatal ultrasound finding that portends high rates of chromosomal abnormalities, structural malformations, and adverse pregnancy outcomes. Automated detection can increase reproducibility and support scalable early screening programs, but supervised deep learning methods are limited by small labelled datasets. This study assesses whether ultrasound-specific self-supervised pretraining can facilitate accurate, robust deep learning detection of cystic hygroma in first-trimester ultrasound images. We fine-tuned the Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE), pretrained on over 370,000 unlabelled ultrasound images, for binary classification of normal controls and cystic hygroma cases used in this study. Performance was evaluated on the same curated ultrasound dataset, preprocessing pipeline, and 4-fold cross-validation protocol as for the DenseNet-169 baseline, using accuracy, sensitivity, specificity, and the area under the receiver operating characteristic curve (ROC-AUC). Model interpretability was analyzed qualitatively using Score-CAM visualizations. USF-MAE outperformed the DenseNet-169 baseline on all evaluation metrics. The proposed model yielded a mean accuracy of 0.96, sensitivity of 0.94, specificity of 0.98, and ROC-AUC of 0.98 compared to 0.93, 0.92, 0.94, and 0.94 for the DenseNet-169 baseline, respectively. Qualitative Score-CAM visualizations of model predictions demonstrated clinical relevance by highlighting expected regions in the fetal neck for both positive and negative cases. Paired statistical analysis using a Wilcoxon signed-rank test confirmed that performance improvements achieved by USF-MAE were statistically significant (p = 0.0057).

[133] KAN-FPN-Stem:A KAN-Enhanced Feature Pyramid Stem for Boosting ViT-based Pose Estimation

HaoNan Tang

Main category: cs.CV

TL;DR: KAN-FPN-Stem improves ViT-based pose estimation by replacing standard convolution with KAN-based layer in FPN fusion, achieving +2.0 AP gain on COCO.

Details

Motivation: Current ViT front-ends for dense prediction (like ViTPose) have simplistic patchification that causes irreversible information loss and struggles with multi-scale variations. The performance bottleneck lies in feature fusion quality rather than feature refinement.

Method: Retains classic FPN “upsample-and-add” fusion but replaces the terminal 3x3 linear smoothing convolution with a KAN-based convolutional layer that adaptively learns and rectifies fusion artifacts using superior non-linear modeling capabilities.

Result: Achieves significant performance boost of up to +2.0 AP over lightweight ViTPose-S baseline on COCO dataset. Demonstrates plug-and-play effectiveness.

Conclusion: Reveals that ViT front-end bottleneck is in feature fusion quality, not feature refinement. Provides effective solution via KAN operator introduction for better multi-scale fusion.

Abstract: Vision Transformers (ViT) have demonstrated significant promise in dense prediction tasks such as pose estimation. However, their performance is frequently constrained by the overly simplistic front-end designs employed in models like ViTPose. This naive patchification mechanism struggles to effectively handle multi-scale variations and results in irreversible information loss during the initial feature extraction phase. To overcome this limitation, we introduce a novel KAN-enhanced FPN-Stem architecture. Through rigorous ablation studies, we first identified that the true bottleneck for performance improvement lies not in plug-and-play attention modules (e.g., CBAM), but in the post-fusion non-linear smoothing step within the FPN. Guided by this insight, our core innovation is to retain the classic “upsample-and-add” fusion stream of the FPN, but replace its terminal, standard linear 3x3 smoothing convolution with a powerful KAN-based convolutional layer. Leveraging its superior non-linear modeling capabilities, this KAN-based layer adaptively learns and rectifies the “artifacts” generated during the multi-scale fusion process. Extensive experiments on the COCO dataset demonstrate that our KAN-FPN-Stem achieves a significant performance boost of up to +2.0 AP over the lightweight ViTPose-S baseline. This work not only delivers a plug-and-play, high-performance module but, more importantly, reveals that: the performance bottleneck in ViT front-end often lies not in ‘feature refinement’ (Attention), but in the quality of ‘feature fusion’ (Fusion). Furthermore, it provides an effective path to address this bottleneck through the introduction of the KAN operator.

[134] Plug In, Grade Right: Psychology-Inspired AGIQA

Zhicheng Liao, Baoliang Chen, Hanwei Zhu, Lingyu Zhu, Shiqi Wang, Weisi Lin

Main category: cs.CV

TL;DR: Proposes AGQG module using Arithmetic Graded Response Model to address semantic drift in AGIQA by modeling image quality as ability to meet graded difficulty levels.

Details

Motivation: Existing AGIQA models suffer from "semantic drift" where image embeddings show inconsistent similarities to different quality grade descriptions, undermining reliability of text-image shared-space learning.

Method: Proposes Arithmetic GRM-based Quality Grading (AGQG) module with two branches: one estimates image ability, the other constructs multiple difficulty levels in arithmetic manner to ensure monotonicity and unimodal quality distribution.

Result: AGQG module shows plug-and-play advantage, consistently improves performance when integrated into various SOTA AGIQA frameworks, and generalizes effectively to both natural and screen content image quality assessment.

Conclusion: AGQG addresses semantic drift in AGIQA through psychometric-inspired graded response modeling, offering interpretable quality distributions and potential as key component in future IQA models.

Abstract: Existing AGIQA models typically estimate image quality by measuring and aggregating the similarities between image embeddings and text embeddings derived from multi-grade quality descriptions. Although effective, we observe that such similarity distributions across grades usually exhibit multimodal patterns. For instance, an image embedding may show high similarity to both “excellent” and “poor” grade descriptions while deviating from the “good” one. We refer to this phenomenon as “semantic drift”, where semantic inconsistencies between text embeddings and their intended descriptions undermine the reliability of text-image shared-space learning. To mitigate this issue, we draw inspiration from psychometrics and propose an improved Graded Response Model (GRM) for AGIQA. The GRM is a classical assessment model that categorizes a subject’s ability across grades using test items with various difficulty levels. This paradigm aligns remarkably well with human quality rating, where image quality can be interpreted as an image’s ability to meet various quality grades. Building on this philosophy, we design a two-branch quality grading module: one branch estimates image ability while the other constructs multiple difficulty levels. To ensure monotonicity in difficulty levels, we further model difficulty generation in an arithmetic manner, which inherently enforces a unimodal and interpretable quality distribution. Our Arithmetic GRM based Quality Grading (AGQG) module enjoys a plug-and-play advantage, consistently improving performance when integrated into various state-of-the-art AGIQA frameworks. Moreover, it also generalizes effectively to both natural and screen content image quality assessment, revealing its potential as a key component in future IQA models.

[135] Meta-information Guided Cross-domain Synergistic Diffusion Model for Low-dose PET Reconstruction

Mengxiao Geng, Ran Hong, Xiaoling Xu, Bingxuan Li, Qiegen Liu

Main category: cs.CV

TL;DR: MiG-DM is a meta-information guided cross-domain diffusion model that integrates clinical parameters and projection-domain physics to generate high-quality low-dose PET images, outperforming existing methods.

Details

Motivation: Low-dose PET imaging reduces radiation exposure but suffers from noise, reduced contrast, and difficulty preserving physiological details. Existing methods neglect both projection-domain physics knowledge and patient-specific meta-information needed for functional-semantic correlation mining.

Method: MiG-DM integrates cross-modal priors with: 1) meta-information encoding module that transforms clinical parameters (patient characteristics, dose info, semi-quantitative parameters) into semantic prompts for cross-modal alignment; 2) cross-domain architecture combining projection-domain and image-domain processing, with a sinogram adapter capturing global physical structures through convolution operations equivalent to global image-domain filtering.

Result: Experiments on UDPET public dataset and clinical datasets with varying dose levels show MiG-DM outperforms state-of-the-art methods in enhancing PET image quality and preserving physiological details.

Conclusion: The proposed MiG-DM successfully integrates meta-information guidance and cross-domain processing to address limitations in low-dose PET imaging, demonstrating superior performance in image quality enhancement and physiological detail preservation.

Abstract: Low-dose PET imaging is crucial for reducing patient radiation exposure but faces challenges like noise interference, reduced contrast, and difficulty in preserving physiological details. Existing methods often neglect both projection-domain physics knowledge and patient-specific meta-information, which are critical for functional-semantic correlation mining. In this study, we introduce a meta-information guided cross-domain synergistic diffusion model (MiG-DM) that integrates comprehensive cross-modal priors to generate high-quality PET images. Specifically, a meta-information encoding module transforms clinical parameters into semantic prompts by considering patient characteristics, dose-related information, and semi-quantitative parameters, enabling cross-modal alignment between textual meta-information and image reconstruction. Additionally, the cross-domain architecture combines projection-domain and image-domain processing. In the projection domain, a specialized sinogram adapter captures global physical structures through convolution operations equivalent to global image-domain filtering. Experiments on the UDPET public dataset and clinical datasets with varying dose levels demonstrate that MiG-DM outperforms state-of-the-art methods in enhancing PET image quality and preserving physiological details.

[136] Hash Grid Feature Pruning

Yangzhi Ma, Bojun Liu, Jie Li, Li Li, Dong Liu

Main category: cs.CV

TL;DR: Hash grid feature pruning method reduces storage/transmission overhead by identifying and removing invalid features in Gaussian splatting, achieving 8% bitrate reduction without performance loss.

Details

Motivation: Hash grids used in implicit neural fields for Gaussian splatting have many invalid features due to irregular 3D distribution of Gaussian splats, causing redundant storage and transmission overhead.

Method: Propose hash grid feature pruning that identifies and prunes invalid features based on input Gaussian splat coordinates, encoding only valid features to reduce storage size.

Result: Achieves average 8% bitrate reduction compared to baseline under Common Test Conditions (CTC) while maintaining model performance, improving rate-distortion performance.

Conclusion: Hash grid feature pruning effectively reduces storage and transmission overhead without compromising performance, offering improved compression efficiency for Gaussian splatting applications.

Abstract: Hash grids are widely used to learn an implicit neural field for Gaussian splatting, serving either as part of the entropy model or for inter-frame prediction. However, due to the irregular and non-uniform distribution of Gaussian splats in 3D space, numerous sparse regions exist, rendering many features in the hash grid invalid. This leads to redundant storage and transmission overhead. In this work, we propose a hash grid feature pruning method that identifies and prunes invalid features based on the coordinates of the input Gaussian splats, so that only the valid features are encoded. This approach reduces the storage size of the hash grid without compromising model performance, leading to improved rate-distortion performance. Following the Common Test Conditions (CTC) defined by the standardization committee, our method achieves an average bitrate reduction of 8% compared to the baseline approach.

[137] Multi-objective hybrid knowledge distillation for efficient deep learning in smart agriculture

Phi-Hung Hoang, Nam-Thuan Trinh, Van-Manh Tran, Thi-Thu-Hong Phan

Main category: cs.CV

TL;DR: A hybrid knowledge distillation framework creates lightweight CNN for edge devices in smart agriculture, achieving near-teacher accuracy with significantly reduced computational cost and model size.

Details

Motivation: Deploying deep learning models on resource-constrained edge devices in smart agriculture is challenging due to the trade-off between computational efficiency and recognition accuracy.

Method: Proposes a hybrid knowledge distillation framework with customized student model combining inverted residual blocks with dense connectivity, trained under ResNet18 teacher guidance using multi-objective strategy integrating hard-label supervision, feature-level distillation, response-level distillation, and self-distillation.

Result: On rice seed variety classification: distilled student achieves 98.56% accuracy (vs teacher’s 98.65%) with only 0.68 GFLOPs and ~1.07M parameters - 2.7x computational cost reduction and >10x model size reduction vs ResNet18. Outperforms DenseNet121 (6x fewer parameters) and ViT (80x fewer parameters) while maintaining comparable/superior accuracy. Consistent gains across multiple plant leaf disease datasets.

Conclusion: The framework demonstrates robustness, efficiency, and strong deployment potential for hardware-limited smart agriculture systems through effective knowledge distillation and model compression.

Abstract: Deploying deep learning models on resource-constrained edge devices remains a major challenge in smart agriculture due to the trade-off between computational efficiency and recognition accuracy. To address this challenge, this study proposes a hybrid knowledge distillation framework for developing a lightweight yet high-performance convolutional neural network. The proposed approach designs a customized student model that combines inverted residual blocks with dense connectivity and trains it under the guidance of a ResNet18 teacher network using a multi-objective strategy that integrates hard-label supervision, feature-level distillation, response-level distillation, and self-distillation. Experiments are conducted on a rice seed variety identification dataset containing nine varieties and further extended to four plant leaf disease datasets, including rice, potato, coffee, and corn, to evaluate generalization capability. On the rice seed variety classification task, the distilled student model achieves an accuracy of 98.56%, which is only 0.09% lower than the teacher model (98.65%), while requiring only 0.68 GFLOPs and approximately 1.07 million parameters. This corresponds to a reduction of about 2.7 times in computational cost and more than 10 times in model size compared with the ResNet18 teacher model. In addition, compared with representative pretrained models, the proposed student reduces the number of parameters by more than 6 times relative to DenseNet121 and by over 80 times compared with the Vision Transformer (ViT) architecture, while maintaining comparable or superior classification accuracy. Consistent performance gains across multiple plant leaf disease datasets further demonstrate the robustness, efficiency, and strong deployment potential of the proposed framework for hardware-limited smart agriculture systems.

[138] Evaluating an Adaptive Multispectral Turret System for Autonomous Tracking Across Variable Illumination Conditions

Aahan Sachdeva, Dhanvinkumar Ganeshkumar, James E. Gallagher, Tyler Treat, Edward J. Oughton

Main category: cs.CV

TL;DR: Adaptive RGB-LWIR fusion framework dynamically selects optimal detection models for different light conditions, significantly outperforming baseline YOLO models across all illumination levels.

Details

Motivation: Traditional RGB detection struggles in low-light, while thermal systems lack color/texture info. Need robust vision for emergency service robots in varying illumination conditions.

Method: Trained 33 YOLO models on 22K+ annotated images across three light levels. Fused aligned RGB and LWIR frames at 11 different ratios (100/0 to 0/100 in 10% increments). Dynamically selects optimal fusion model based on illumination.

Result: Best full-light model (80/20 RGB-LWIR): 92.8% mean confidence. Best dim-light model (90/10): 92.0%. Both significantly outperformed YOLOv5n and YOLOv11n baselines. No-light model (40/60): 71.0%, exceeding baselines though not statistically significant.

Conclusion: Adaptive RGB-LWIR fusion improves detection confidence and reliability across all illumination conditions, enhancing autonomous robotic vision performance for emergency services applications.

Abstract: Autonomous robotic platforms are playing a growing role across the emergency services sector, supporting missions such as search and rescue operations in disaster zones and reconnaissance. However, traditional red-green-blue (RGB) detection pipelines struggle in low-light environments, and thermal-based systems lack color and texture information. To overcome these limitations, we present an adaptive framework that fuses RGB and long-wave infrared (LWIR) video streams at multiple fusion ratios and dynamically selects the optimal detection model for each illumination condition. We trained 33 You Only Look Once (YOLO) models on over 22,000 annotated images spanning three light levels: no-light (<10 lux), dim-light (10-1000 lux), and full-light (>1000 lux). To integrate both modalities, fusion was performed by blending aligned RGB and LWIR frames at eleven ratios, from full RGB (100/0) to full LWIR (0/100) in 10% increments. Evaluation showed that the best full-light model (80/20 RGB-LWIR) and dim-light model (90/10 fusion) achieved 92.8% and 92.0% mean confidence; both significantly outperformed the YOLOv5 nano (YOLOv5n) and YOLOv11 nano (YOLOv11n) baselines. Under no-light conditions, the top 40/60 fusion reached 71.0%, exceeding baselines though not statistically significant. Adaptive RGB-LWIR fusion improved detection confidence and reliability across all illumination conditions, enhancing autonomous robotic vision performance.

[139] Human-Aligned Generative Perception: Bridging Psychophysics and Generative Models

Antara Titikhsha, Om Kulkarni, Dharun Muthaiah

Main category: cs.CV

TL;DR: Using lightweight discriminators as external guidance, the paper introduces geometric understanding to text-to-image diffusion models without specialized training, enabling separation of geometry and style for better semantic alignment.

Details

Motivation: Text-to-image diffusion models generate detailed textures but often fail to follow strict geometric constraints, especially when those constraints conflict with text prompt styles, revealing a semantic gap between human perception and generative models.

Method: Proposes Human Perception Embedding (HPE) teacher trained on THINGS triplet dataset to capture human sensitivity to object shape. Injects gradients from this teacher into latent diffusion process to separate geometry and style controllably. Evaluates across Stable Diffusion v1.5, SiT-XL/2 flow-matching model, and PixArt-Σ diffusion transformer.

Result: Shows flow models tend to drift back without continuous guidance. Demonstrates zero-shot transfer of complex 3D shapes (like Eames chair) onto conflicting materials (pink metal). Guided generation improves semantic alignment by about 80% compared to unguided baselines.

Conclusion: Small teacher models can reliably guide large generative systems, enabling stronger geometric control and broadening the creative range of text-to-image synthesis.

Abstract: Text-to-image diffusion models generate highly detailed textures, yet they often rely on surface appearance and fail to follow strict geometric constraints, particularly when those constraints conflict with the style implied by the text prompt. This reflects a broader semantic gap between human perception and current generative models. We investigate whether geometric understanding can be introduced without specialized training by using lightweight, off-the-shelf discriminators as external guidance signals. We propose a Human Perception Embedding (HPE) teacher trained on the THINGS triplet dataset, which captures human sensitivity to object shape. By injecting gradients from this teacher into the latent diffusion process, we show that geometry and style can be separated in a controllable manner. We evaluate this approach across three architectures: Stable Diffusion v1.5 with a U-Net backbone, the flow-matching model SiT-XL/2, and the diffusion transformer PixArt-Σ. Our experiments reveal that flow models tend to drift back toward their default trajectories without continuous guidance, and we demonstrate zero-shot transfer of complex three-dimensional shapes, such as an Eames chair, onto conflicting materials such as pink metal. This guided generation improves semantic alignment by about 80 percent compared to unguided baselines. Overall, our results show that small teacher models can reliably guide large generative systems, enabling stronger geometric control and broadening the creative range of text-to-image synthesis.

[140] GeCo: A Differentiable Geometric Consistency Metric for Video Generation

Leslie Gu, Junhwa Hur, Charles Herrmann, Fangneng Zhan, Todd Zickler, Deqing Sun, Hanspeter Pfister

Main category: cs.CV

TL;DR: GeCo is a geometry-grounded metric that detects geometric deformation and occlusion-inconsistency artifacts in static scenes using motion and depth priors, enabling benchmarking of video generation models and training-free guidance.

Details

Motivation: Video generation models often produce geometric deformation and occlusion-inconsistency artifacts that are difficult to detect and quantify systematically. There's a need for an interpretable metric to identify these artifacts and benchmark model performance.

Method: GeCo fuses residual motion and depth priors to produce dense consistency maps that reveal geometric deformation and occlusion-inconsistency artifacts. The method is geometry-grounded and generates interpretable visualizations of these artifacts.

Result: GeCo successfully detects deformation and occlusion artifacts, enabling systematic benchmarking of recent video generation models to uncover common failure modes. It also functions effectively as a training-free guidance loss to reduce deformation artifacts during video generation.

Conclusion: GeCo provides a valuable tool for analyzing and improving video generation models by offering an interpretable, geometry-based metric for artifact detection that can be used both for benchmarking and as a guidance mechanism during generation.

Abstract: We introduce GeCo, a geometry-grounded metric for jointly detecting geometric deformation and occlusion-inconsistency artifacts in static scenes. By fusing residual motion and depth priors, GeCo produces interpretable, dense consistency maps that reveal these artifacts. We use GeCo to systematically benchmark recent video generation models, uncovering common failure modes, and further employ it as a training-free guidance loss to reduce deformation artifacts during video generation.

[141] The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency

Dingyu Wang, Zimu Yuan, Jiajun Liu, Shanggui Liu, Nan Zhou, Tianxing Xu, Di Huang, Dong Jiang

Main category: cs.CV

TL;DR: The B&J Benchmark reveals AI models excel at structured medical questions but struggle with open-ended multimodal clinical reasoning, showing VLMs have severe limitations in image interpretation and hallucination issues.

Details

Motivation: Current benchmarks fail to capture the integrated, multimodal reasoning essential for real-world patient care, necessitating a more comprehensive evaluation of foundation models' true clinical reasoning capabilities beyond narrow exam success.

Method: Developed the Bones and Joints (B&J) Benchmark with 1,245 questions from real-world patient cases in orthopedics/sports medicine, assessing 7 clinical reasoning tasks. Evaluated 11 VLMs and 6 LLMs against expert-derived ground truth.

Result: Models achieved >90% accuracy on structured multiple-choice questions but dropped to ~60% on open-ended multimodal tasks. VLMs showed severe limitations in medical image interpretation and text-driven hallucinations, ignoring contradictory visual evidence. Medical fine-tuned models showed no consistent advantage over general-purpose ones.

Conclusion: Current AI models are not clinically competent for complex multimodal reasoning. Safe deployment should be limited to supportive text-based roles. Future advancement requires fundamental breakthroughs in multimodal integration and visual understanding.

Abstract: Background: The rapid integration of foundation models into clinical practice and public health necessitates a rigorous evaluation of their true clinical reasoning capabilities beyond narrow examination success. Current benchmarks, typically based on medical licensing exams or curated vignettes, fail to capture the integrated, multimodal reasoning essential for real-world patient care. Methods: We developed the Bones and Joints (B&J) Benchmark, a comprehensive evaluation framework comprising 1,245 questions derived from real-world patient cases in orthopedics and sports medicine. This benchmark assesses models across 7 tasks that mirror the clinical reasoning pathway, including knowledge recall, text and image interpretation, diagnosis generation, treatment planning, and rationale provision. We evaluated eleven vision-language models (VLMs) and six large language models (LLMs), comparing their performance against expert-derived ground truth. Results: Our results demonstrate a pronounced performance gap between task types. While state-of-the-art models achieved high accuracy, exceeding 90%, on structured multiple-choice questions, their performance markedly declined on open-ended tasks requiring multimodal integration, with accuracy scarcely reaching 60%. VLMs demonstrated substantial limitations in interpreting medical images and frequently exhibited severe text-driven hallucinations, often ignoring contradictory visual evidence. Notably, models specifically fine-tuned for medical applications showed no consistent advantage over general-purpose counterparts. Conclusions: Current artificial intelligence models are not yet clinically competent for complex, multimodal reasoning. Their safe deployment should currently be limited to supportive, text-based roles. Future advancement in core clinical tasks awaits fundamental breakthroughs in multimodal integration and visual understanding.

[142] FETAL-GAUGE: A Benchmark for Assessing Vision-Language Models in Fetal Ultrasound

Hussain Alasmawi, Numan Saeed, Mohammad Yaqub

Main category: cs.CV

TL;DR: Fetal-Gauge is the first comprehensive VQA benchmark for evaluating Vision-Language Models on fetal ultrasound tasks, revealing current models perform poorly (55% accuracy) and highlighting the need for specialized medical AI development.

Details

Motivation: Addresses the global shortage of trained sonographers and the lack of standardized benchmarks for evaluating VLMs in fetal ultrasound imaging, which is challenging due to operator dependency and limited public datasets.

Method: Created Fetal-Gauge benchmark with over 42,000 images and 93,000 question-answer pairs covering multiple clinical tasks: anatomical plane identification, visual grounding, fetal orientation assessment, clinical view conformity, and clinical diagnosis.

Result: Systematic evaluation of state-of-the-art VLMs shows poor performance - best model achieves only 55% accuracy, far below clinical requirements, revealing critical limitations in current models for fetal ultrasound interpretation.

Conclusion: Fetal-Gauge establishes a rigorous foundation for advancing multimodal deep learning in prenatal care, highlighting the urgent need for domain-adapted architectures and specialized training approaches to address global healthcare accessibility challenges.

Abstract: The growing demand for prenatal ultrasound imaging has intensified a global shortage of trained sonographers, creating barriers to essential fetal health monitoring. Deep learning has the potential to enhance sonographers’ efficiency and support the training of new practitioners. Vision-Language Models (VLMs) are particularly promising for ultrasound interpretation, as they can jointly process images and text to perform multiple clinical tasks within a single framework. However, despite the expansion of VLMs, no standardized benchmark exists to evaluate their performance in fetal ultrasound imaging. This gap is primarily due to the modality’s challenging nature, operator dependency, and the limited public availability of datasets. To address this gap, we present Fetal-Gauge, the first and largest visual question answering benchmark specifically designed to evaluate VLMs across various fetal ultrasound tasks. Our benchmark comprises over 42,000 images and 93,000 question-answer pairs, spanning anatomical plane identification, visual grounding of anatomical structures, fetal orientation assessment, clinical view conformity, and clinical diagnosis. We systematically evaluate several state-of-the-art VLMs, including general-purpose and medical-specific models, and reveal a substantial performance gap: the best-performing model achieves only 55% accuracy, far below clinical requirements. Our analysis identifies critical limitations of current VLMs in fetal ultrasound interpretation, highlighting the urgent need for domain-adapted architectures and specialized training approaches. Fetal-Gauge establishes a rigorous foundation for advancing multimodal deep learning in prenatal care and provides a pathway toward addressing global healthcare accessibility challenges. Our benchmark will be publicly available once the paper gets accepted.

[143] A Three-Level Alignment Framework for Large-Scale 3D Retrieval and Controlled 4D Generation

Philip Xu, David Elizondo, Raouf Hamzaoui

Main category: cs.CV

TL;DR: Uni4D is a unified framework for large-scale open-vocabulary 3D retrieval and controlled 4D generation using structured three-level alignment across text, 3D models, and images.

Details

Motivation: To advance dynamic multimodal understanding by creating a unified system that can handle both 3D retrieval and 4D generation through improved cross-modal alignment.

Method: Uses structured three-level alignment across text, 3D models, and images; built on Align3D 130 dataset; employs 3D text multi-head attention and search model; includes three alignment components: precise text-to-3D retrieval, multi-view 3D-to-image alignment, and image-to-text alignment for 4D generation.

Result: Achieves high-quality 3D retrieval and controllable 4D generation, demonstrating effectiveness in dynamic multimodal understanding and practical applications.

Conclusion: Uni4D advances the field by providing a unified framework that successfully bridges 3D retrieval and 4D generation through structured multimodal alignment.

Abstract: We introduce Uni4D, a unified framework for large scale open vocabulary 3D retrieval and controlled 4D generation based on structured three level alignment across text, 3D models, and image modalities. Built upon the Align3D 130 dataset, Uni4D employs a 3D text multi head attention and search model to optimize text to 3D retrieval through improved semantic alignment. The framework further strengthens cross modal alignment through three components: precise text to 3D retrieval, multi view 3D to image alignment, and image to text alignment for generating temporally consistent 4D assets. Experimental results demonstrate that Uni4D achieves high quality 3D retrieval and controllable 4D generation, advancing dynamic multimodal understanding and practical applications.

[144] Learning Dynamic Scene Reconstruction with Sinusoidal Geometric Priors

Tian Guo, Hui Yuan, Philip Xu, David Elizondo

Main category: cs.CV

TL;DR: SirenPose is a novel loss function combining sinusoidal representation networks with geometric priors to improve 3D scene reconstruction accuracy, especially for fast-moving and multi-target scenes.

Details

Motivation: Existing approaches struggle with motion modeling accuracy and spatiotemporal consistency in dynamic scenes with fast motion and multiple targets. There's a need for better methods that can maintain coherence across both spatial and temporal dimensions.

Method: SirenPose combines periodic activation properties of sinusoidal representation networks with geometric priors from keypoint structures. It introduces physics-inspired constraint mechanisms to enforce coherent keypoint predictions across spatial and temporal dimensions. The training dataset was expanded to 600,000 annotated instances.

Result: Models trained with SirenPose achieve significant improvements in spatiotemporal consistency metrics compared to prior methods. The approach shows superior performance in handling rapid motion and complex scene changes.

Conclusion: SirenPose effectively addresses limitations in dynamic 3D scene reconstruction by combining sinusoidal representations with geometric priors, resulting in improved accuracy and consistency for challenging motion scenarios.

Abstract: We propose SirenPose, a novel loss function that combines the periodic activation properties of sinusoidal representation networks with geometric priors derived from keypoint structures to improve the accuracy of dynamic 3D scene reconstruction. Existing approaches often struggle to maintain motion modeling accuracy and spatiotemporal consistency in fast moving and multi target scenes. By introducing physics inspired constraint mechanisms, SirenPose enforces coherent keypoint predictions across both spatial and temporal dimensions. We further expand the training dataset to 600,000 annotated instances to support robust learning. Experimental results demonstrate that models trained with SirenPose achieve significant improvements in spatiotemporal consistency metrics compared to prior methods, showing superior performance in handling rapid motion and complex scene changes.

[145] Attack-Aware Deepfake Detection under Counter-Forensic Manipulations

Noor Fatima, Hasan Faraz Khan, Muzammil Behzad

Main category: cs.CV

TL;DR: A robust deepfake detector with calibrated probabilities and tamper heatmaps using red-team training and test-time defense in a two-stream architecture.

Details

Motivation: Need for deepfake detectors that are robust to real-world attacks, provide well-calibrated probabilities for reliable decision-making, and offer transparent evidence through tamper localization under realistic deployment conditions.

Method: Two-stream architecture: one stream encodes semantic content via pretrained backbone, other extracts forensic residuals, fused via lightweight residual adapter. Uses red-team training with worst-of-K counter-forensics attacks and test-time defense with randomized jitters. Weakly supervised heatmaps guided by face-box masks.

Result: Near-perfect ranking across attacks, low calibration error, minimal abstention risk, controlled degradation under regrain attacks, and effective tamper localization on standard deepfake datasets and surveillance-style benchmarks.

Conclusion: The method establishes a modular, data-efficient, practically deployable baseline for attack-aware detection with calibrated probabilities and actionable heatmaps, demonstrating robustness under realistic conditions.

Abstract: This work presents an attack-aware deepfake and image-forensics detector designed for robustness, well-calibrated probabilities, and transparent evidence under realistic deployment conditions. The method combines red-team training with randomized test-time defense in a two-stream architecture, where one stream encodes semantic content using a pretrained backbone and the other extracts forensic residuals, fused via a lightweight residual adapter for classification, while a shallow Feature Pyramid Network style head produces tamper heatmaps under weak supervision. Red-team training applies worst-of-K counter-forensics per batch, including JPEG realign and recompress, resampling warps, denoise-to-regrain operations, seam smoothing, small color and gamma shifts, and social-app transcodes, while test-time defense injects low-cost jitters such as resize and crop phase changes, mild gamma variation, and JPEG phase shifts with aggregated predictions. Heatmaps are guided to concentrate within face regions using face-box masks without strict pixel-level annotations. Evaluation on existing benchmarks, including standard deepfake datasets and a surveillance-style split with low light and heavy compression, reports clean and attacked performance, AUC, worst-case accuracy, reliability, abstention quality, and weak-localization scores. Results demonstrate near-perfect ranking across attacks, low calibration error, minimal abstention risk, and controlled degradation under regrain, establishing a modular, data-efficient, and practically deployable baseline for attack-aware detection with calibrated probabilities and actionable heatmaps.

[146] PortionNet: Distilling 3D Geometric Knowledge for Food Nutrition Estimation

Darrin Bright, Rakshith Raj, Kanchan Keisham

Main category: cs.CV

TL;DR: PortionNet: Cross-modal knowledge distillation framework that learns 3D geometry from point clouds during training but only needs RGB images at inference for accurate food nutrition estimation.

Details

Motivation: Accurate food nutrition estimation from single images is challenging due to loss of 3D information. Depth-based methods require specialized hardware (depth sensors) that are unavailable on most smartphones, limiting practical deployment.

Method: Proposes PortionNet, a cross-modal knowledge distillation framework with dual-mode training strategy. Uses a lightweight adapter network to mimic point cloud representations, enabling pseudo-3D reasoning without hardware requirements. Learns geometric features from point clouds during training but only requires RGB images at inference.

Result: Achieves state-of-the-art performance on MetaFood3D dataset, outperforming all previous methods in both volume and energy estimation. Cross-dataset evaluation on SimpleFood45 demonstrates strong generalization in energy estimation.

Conclusion: PortionNet enables accurate 3D-aware food nutrition estimation from single RGB images without requiring depth sensors, making it practical for deployment on standard smartphones while maintaining high accuracy.

Abstract: Accurate food nutrition estimation from single images is challenging due to the loss of 3D information. While depth-based methods provide reliable geometry, they remain inaccessible on most smartphones because of depth-sensor requirements. To overcome this challenge, we propose PortionNet, a novel cross-modal knowledge distillation framework that learns geometric features from point clouds during training while requiring only RGB images at inference. Our approach employs a dual-mode training strategy where a lightweight adapter network mimics point cloud representations, enabling pseudo-3D reasoning without any specialized hardware requirements. PortionNet achieves state-of-the-art performance on MetaFood3D, outperforming all previous methods in both volume and energy estimation. Cross-dataset evaluation on SimpleFood45 further demonstrates strong generalization in energy estimation.

Alireza Moayedikia, Sattar Dorafshan

Main category: cs.CV

TL;DR: Multi-modal attention network fuses radar temporal patterns with thermal spatial signatures for bridge deck delamination detection, outperforming single-modal and concatenation fusion approaches with uncertainty quantification for safety-critical decisions.

Details

Motivation: Visual inspection of deteriorating civil infrastructure has limitations. Single-modal approaches (Ground Penetrating Radar and Infrared Thermography) face complementary constraints: radar struggles with moisture and shallow defects, while thermography has weather dependency and limited depth. Need automated multi-modal techniques for better subsurface defect detection.

Method: Multi-modal attention network with temporal attention for radar processing, spatial attention for thermal features, and cross-modal fusion with learnable embeddings to discover complementary defect patterns. Incorporates uncertainty quantification through Monte Carlo dropout and learned variance estimation, decomposing uncertainty into epistemic and aleatoric components.

Result: On five bridge datasets with balanced to moderately imbalanced data, approach substantially outperforms baselines in accuracy and AUC. Cross-modal attention provides critical gains beyond within-modality attention. Multi-head mechanisms achieve improved calibration. Uncertainty quantification reduces calibration error and enables selective prediction by rejecting uncertain cases. However, under extreme class imbalance, attention mechanisms show vulnerability to majority class collapse.

Conclusion: Attention-based architecture performs well across typical scenarios, while extreme imbalance requires specialized techniques. System maintains deployment efficiency for real-time inspection with characterized capabilities and limitations. Provides actionable guidance for multi-modal bridge deck inspection.

Abstract: Deteriorating civil infrastructure requires automated inspection techniques overcoming limitations of visual assessment. While Ground Penetrating Radar and Infrared Thermography enable subsurface defect detection, single modal approaches face complementary constraints radar struggles with moisture and shallow defects, while thermography exhibits weather dependency and limited depth. This paper presents a multi modal attention network fusing radar temporal patterns with thermal spatial signatures for bridge deck delamination detection. Our architecture introduces temporal attention for radar processing, spatial attention for thermal features, and cross modal fusion with learnable embeddings discovering complementary defect patterns invisible to individual sensors. We incorporate uncertainty quantification through Monte Carlo dropout and learned variance estimation, decomposing uncertainty into epistemic and aleatoric components for safety critical decisions. Experiments on five bridge datasets reveal that on balanced to moderately imbalanced data, our approach substantially outperforms baselines in accuracy and AUC representing meaningful improvements over single modal and concatenation based fusion. Ablation studies demonstrate cross modal attention provides critical gains beyond within modality attention, while multi head mechanisms achieve improved calibration. Uncertainty quantification reduces calibration error, enabling selective prediction by rejecting uncertain cases. However, under extreme class imbalance, attention mechanisms show vulnerability to majority class collapse. These findings provide actionable guidance: attention based architecture performs well across typical scenarios, while extreme imbalance requires specialized techniques. Our system maintains deployment efficiency, enabling real time inspection with characterized capabilities and limitations.

[148] MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation

Run Ling, Ke Cao, Jian Lu, Ao Ma, Haowei Liu, Runze He, Changwei Wang, Rongtao Xu, Yihua Shao, Zhanjie Zhang, Peng Wu, Guibing Guo, Wei Feng, Zheng Zhang, Jingjing Lv, Junjie Shen, Ching Law, Xingwei Wang

Main category: cs.CV

TL;DR: MoFu is a unified framework for multi-subject video generation that addresses scale inconsistency and permutation sensitivity through Scale-Aware Modulation and Fourier Fusion, achieving superior subject fidelity and visual quality.

Details

Motivation: Current multi-subject video generation methods suffer from two key challenges: scale inconsistency (unnatural subject size variations) and permutation sensitivity (subject distortion based on reference input order), which degrade the quality and naturalness of generated videos.

Method: MoFu introduces: 1) Scale-Aware Modulation (SMO) - an LLM-guided module that extracts implicit scale cues from prompts to modulate features for consistent subject sizes; 2) Fourier Fusion - processes reference features via Fast Fourier Transform to create unified representations insensitive to input order; 3) Scale-Permutation Stability Loss - jointly encourages scale-consistent and permutation-invariant generation.

Result: Extensive experiments show MoFu significantly outperforms existing methods in preserving natural scale, subject fidelity, and overall visual quality. The authors also establish a dedicated benchmark with controlled variations in subject scale and reference permutation for evaluation.

Conclusion: MoFu successfully addresses both scale inconsistency and permutation sensitivity in multi-subject video generation through its unified framework, achieving more natural and consistent video synthesis from textual prompts and multiple reference images.

Abstract: Multi-subject video generation aims to synthesize videos from textual prompts and multiple reference images, ensuring that each subject preserves natural scale and visual fidelity. However, current methods face two challenges: scale inconsistency, where variations in subject size lead to unnatural generation, and permutation sensitivity, where the order of reference inputs causes subject distortion. In this paper, we propose MoFu, a unified framework that tackles both challenges. For scale inconsistency, we introduce Scale-Aware Modulation (SMO), an LLM-guided module that extracts implicit scale cues from the prompt and modulates features to ensure consistent subject sizes. To address permutation sensitivity, we present a simple yet effective Fourier Fusion strategy that processes the frequency information of reference features via the Fast Fourier Transform to produce a unified representation. Besides, we design a Scale-Permutation Stability Loss to jointly encourage scale-consistent and permutation-invariant generation. To further evaluate these challenges, we establish a dedicated benchmark with controlled variations in subject scale and reference permutation. Extensive experiments demonstrate that MoFu significantly outperforms existing methods in preserving natural scale, subject fidelity, and overall visual quality.

[149] VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning

Yang Ding, Yizhen Zhang, Xin Lai, Ruihang Chu, Yujiu Yang

Main category: cs.CV

TL;DR: VideoZoomer: An agentic framework enabling MLLMs to dynamically control visual focus during reasoning for long video understanding, using temporal zooming to gather fine-grained evidence progressively.

Details

Motivation: Current MLLMs struggle with long video understanding due to limited context windows, relying on uniform frame sampling or static pre-selection that may miss critical evidence and cannot correct initial selection errors during reasoning.

Method: Proposes VideoZoomer framework where MLLMs start with coarse low-frame-rate overview, then invoke temporal zoom tool to obtain high-frame-rate clips at autonomously chosen moments, gathering evidence in multi-turn interactive manner. Uses two-stage training: supervised fine-tuning on distilled exemplar and reflection trajectories, followed by reinforcement learning.

Result: The 7B model demonstrates diverse and complex reasoning patterns, achieving strong performance across long video understanding benchmarks, surpassing open-source models and rivaling proprietary systems while maintaining superior efficiency under reduced frame budgets.

Conclusion: VideoZoomer enables dynamic visual focus control in MLLMs for long video understanding, overcoming limitations of static approaches through agentic temporal zooming, resulting in state-of-the-art performance with efficient resource usage.

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language tasks yet remain limited in long video understanding due to the limited context window. Consequently, prevailing approaches tend to rely on uniform frame sampling or static pre-selection, which might overlook critical evidence and unable to correct its initial selection error during its reasoning process. To overcome these limitations, we propose VideoZoomer, a novel agentic framework that enables MLLMs to dynamically control their visual focus during reasoning. Starting from a coarse low-frame-rate overview, VideoZoomer invokes a temporal zoom tool to obtain high-frame-rate clips at autonomously chosen moments, thereby progressively gathering fine-grained evidence in a multi-turn interactive manner. Accordingly, we adopt a two-stage training strategy: a cold-start supervised fine-tuning phase on a curated dataset of distilled exemplar and reflection trajectories, followed by reinforcement learning to further refine the agentic policy. Extensive experiments demonstrate that our 7B model delivers diverse and complex reasoning patterns, yielding strong performance across a broad set of long video understanding and reasoning benchmarks. These emergent capabilities allow it to consistently surpass existing open-source models and even rival proprietary systems on challenging tasks, while achieving superior efficiency under reduced frame budgets.

[150] SpotEdit: Selective Region Editing in Diffusion Transformers

Zhibin Qin, Zhenxiong Tan, Zeqing Wang, Songhua Liu, Xinchao Wang

Main category: cs.CV

TL;DR: SpotEdit is a training-free diffusion editing framework that selectively updates only modified image regions instead of uniformly processing all tokens, reducing redundant computation while preserving unchanged areas.

Details

Motivation: Current diffusion transformer models uniformly process all image tokens at every timestep during editing, causing redundant computation for unchanged regions and potentially degrading unmodified areas. Most edits only involve small region modifications, making full regeneration unnecessary.

Method: SpotEdit has two components: SpotSelector identifies stable/unmodified regions via perceptual similarity and skips their computation by reusing conditional image features; SpotFusion adaptively blends these features with edited tokens through a dynamic fusion mechanism to preserve contextual coherence.

Result: By selectively updating only modified regions, SpotEdit reduces unnecessary computation while maintaining high fidelity in unmodified areas, achieving efficient and precise image editing.

Conclusion: SpotEdit demonstrates that full regeneration of all image regions during editing is unnecessary, and selective region updating can achieve both computational efficiency and editing quality preservation.

Abstract: Diffusion Transformer models have significantly advanced image editing by encoding conditional images and integrating them into transformer layers. However, most edits involve modifying only small regions, while current methods uniformly process and denoise all tokens at every timestep, causing redundant computation and potentially degrading unchanged areas. This raises a fundamental question: Is it truly necessary to regenerate every region during editing? To address this, we propose SpotEdit, a training-free diffusion editing framework that selectively updates only the modified regions. SpotEdit comprises two key components: SpotSelector identifies stable regions via perceptual similarity and skips their computation by reusing conditional image features; SpotFusion adaptively blends these features with edited tokens through a dynamic fusion mechanism, preserving contextual coherence and editing quality. By reducing unnecessary computation and maintaining high fidelity in unmodified areas, SpotEdit achieves efficient and precise image editing.

[151] DeMoGen: Towards Decompositional Human Motion Generation with Energy-Based Diffusion Models

Jianrong Zhang, Hehe Fan, Yi Yang

Main category: cs.CV

TL;DR: DeMoGen is a compositional training paradigm using energy-based diffusion models to decompose holistic human motions into reusable semantic primitives without ground-truth concept motions, enabling motion disentanglement and flexible recombination.

Details

Motivation: Existing motion modeling approaches focus on forward generation (text-to-motion) or composition from known concepts, but lack the inverse capability to decompose complex motions into meaningful sub-components for understanding and reusability.

Method: Energy-based diffusion model that captures composed distribution of multiple motion concepts; three training variants: DeMoGen-Exp (explicit decomposed text prompts), DeMoGen-OSS (orthogonal self-supervised decomposition), and DeMoGen-SC (semantic consistency between original and decomposed embeddings).

Result: Successfully disentangles reusable motion primitives from complex sequences, enables flexible recombination to generate diverse novel motions beyond training distribution, and provides a constructed text-decomposed dataset for compositional training.

Conclusion: DeMoGen provides an effective framework for decompositional learning of human motion, enabling both analysis (decomposition) and synthesis (recombination) of compositional motion concepts without requiring ground-truth concept motions.

Abstract: Human motions are compositional: complex behaviors can be described as combinations of simpler primitives. However, existing approaches primarily focus on forward modeling, e.g., learning holistic mappings from text to motion or composing a complex motion from a set of motion concepts. In this paper, we consider the inverse perspective: decomposing a holistic motion into semantically meaningful sub-components. We propose DeMoGen, a compositional training paradigm for decompositional learning that employs an energy-based diffusion model. This energy formulation directly captures the composed distribution of multiple motion concepts, enabling the model to discover them without relying on ground-truth motions for individual concepts. Within this paradigm, we introduce three training variants to encourage a decompositional understanding of motion: 1. DeMoGen-Exp explicitly trains on decomposed text prompts; 2. DeMoGen-OSS performs orthogonal self-supervised decomposition; 3. DeMoGen-SC enforces semantic consistency between original and decomposed text embeddings. These variants enable our approach to disentangle reusable motion primitives from complex motion sequences. We also demonstrate that the decomposed motion concepts can be flexibly recombined to generate diverse and novel motions, generalizing beyond the training distribution. Additionally, we construct a text-decomposed dataset to support compositional training, serving as an extended resource to facilitate text-to-motion generation and motion composition.

[152] The Multi-View Paradigm Shift in MRI Radiomics: Predicting MGMT Methylation in Glioblastoma

Mariya Miteva, Maria Nisheva-Pavlova

Main category: cs.CV

TL;DR: A multi-view VAE framework integrates radiomic features from T1Gd and FLAIR MRI to predict MGMT promoter methylation in glioblastoma, addressing limitations of conventional unimodal approaches.

Details

Motivation: Non-invasive inference of molecular tumor characteristics like MGMT promoter methylation is crucial in glioblastoma for prognosis and treatment. Conventional radiomics methods suffer from high feature redundancy and incomplete modeling of modality-specific information.

Method: Multi-view latent representation learning using variational autoencoders (VAE) with independent probabilistic encoders for each modality (T1Gd and FLAIR MRI). Fusion occurs in compact latent space to preserve modality-specific structure while enabling multimodal integration.

Result: The framework generates latent embeddings that are used for MGMT promoter methylation classification, though specific performance metrics are not provided in the abstract.

Conclusion: The proposed multi-view VAE approach effectively integrates complementary radiomic features from different MRI modalities for improved molecular characterization of glioblastoma tumors.

Abstract: Non-invasive inference of molecular tumor characteristics from medical imaging is a central goal of radiogenomics, particularly in glioblastoma (GBM), where O6-methylguanine-DNA methyltransferase (MGMT) promoter methylation carries important prognostic and therapeutic significance. Although radiomics-based machine learning methods have shown promise for this task, conventional unimodal and early-fusion approaches are often limited by high feature redundancy and an incomplete modeling of modality-specific information. In this work, we introduce a multi-view latent representation learning framework based on variational autoencoders (VAE) to integrate complementary radiomic features derived from post-contrast T1-weighted (T1Gd) and Fluid-Attenuated Inversion Recovery (FLAIR) magnetic resonance imaging (MRI). By encoding each modality through an independent probabilistic encoder and performing fusion in a compact latent space, the proposed approach preserves modality-specific structure while enabling effective multimodal integration. The resulting latent embeddings are subsequently used for MGMT promoter methylation classification.

[153] Feature Learning with Multi-Stage Vision Transformers on Inter-Modality HER2 Status Scoring and Tumor Classification on Whole Slides

Olaide N. Oyelade, Oliver Hoxey, Yulia Humrye

Main category: cs.CV

TL;DR: A vision transformer-based pipeline for HER2 scoring that jointly analyzes H&E and IHC whole slide images with pixel-level localization of HER2 status (0, 1+, 2+, 3+).

Details

Motivation: Current deep learning methods for HER2 scoring lack pixel-level localization capabilities and struggle with joint analysis of H&E and IHC stained images, which is crucial for accurate HER2 protein expression assessment in cancer treatment planning.

Method: End-to-end vision transformer pipeline with patch-wise H&E processing for tumor localization, novel mapping function to correlate IHC regions with H&E malignant areas, and clinically-inspired 4-way HER2 scoring mechanism (0, 1+, 2+, 3+) with pixel-level annotation.

Result: Achieved 0.94 classification accuracy and 0.933 specificity for HER2 status prediction in 4-way scoring, with good tumor localization accuracy and performance comparable to human pathologists on WSI patches.

Conclusion: The proposed ViT-based pipeline successfully enables joint evaluation of H&E and IHC images for accurate HER2 scoring with pixel-level localization, demonstrating clinical applicability for automated HER2 assessment in cancer pathology.

Abstract: The popular use of histopathology images, such as hematoxylin and eosin (H&E), has proven to be useful in detecting tumors. However, moving such cancer cases forward for treatment requires accurate on the amount of the human epidermal growth factor receptor 2 (HER2) protein expression. Predicting both the lower and higher levels of HER2 can be challenging. Moreover, jointly analyzing H&E and immunohistochemistry (IHC) stained images for HER2 scoring is difficult. Although several deep learning methods have been investigated to address the challenge of HER2 scoring, they suffer from providing a pixel-level localization of HER2 status. In this study, we propose a single end-to-end pipeline using a system of vision transformers with HER2 status scoring on whole slide images of WSIs. The method includes patch-wise processing of H&E WSIs for tumor localization. A novel mapping function is proposed to correspondingly identify correlated IHC WSIs regions with malignant regions on H&E. A clinically inspired HER2 scoring mechanism is embedded in the pipeline and allows for automatic pixel-level annotation of 4-way HER2 scoring (0, 1+, 2+, and 3+). Also, the proposed method accurately returns HER2-negative and HER2-positive. Privately curated datasets were collaboratively extracted from 13 different cases of WSIs of H&E and IHC. A thorough experiment is conducted on the proposed method. Results obtained showed a good classification accuracy during tumor localization. Also, a classification accuracy of 0.94 and a specificity of 0.933 were returned for the prediction of HER2 status, scoring in the 4-way methods. The applicability of the proposed pipeline was investigated using WSIs patches as comparable to human pathologists. Findings from the study showed the usability of jointly evaluated H&E and IHC images on end-to-end ViTs-based models for HER2 scoring

[154] Human-like visual computing advances explainability and few-shot learning in deep neural networks for complex physiological data

Alaa Alahmadi, Mohamed Hasan

Main category: cs.CV

TL;DR: Human-inspired pseudo-coloring improves ECG interpretation in deep neural networks, enabling few-shot learning and better explainability for detecting drug-induced long QT syndrome.

Details

Motivation: Current deep learning models for ECG analysis require large datasets and lack interpretability, limiting clinical reliability. There's a need for data-efficient, explainable models that align with human reasoning for medical applications.

Method: Used perception-informed pseudo-coloring technique to encode clinically relevant temporal features (like QT-interval duration) into structured color representations. Applied prototypical networks and ResNet-18 architecture for one-shot and few-shot learning on ECG images from single cardiac cycles and full 10-second rhythms.

Result: Models learned discriminative and interpretable features from as few as 1-5 training examples. Pseudo-coloring guided attention toward clinically meaningful ECG features while suppressing irrelevant components. Aggregating multiple cardiac cycles further improved performance, mimicking human perceptual averaging.

Conclusion: Human-like perceptual encoding can bridge data efficiency, explainability, and causal reasoning in medical machine intelligence, particularly for challenging cases like drug-induced long QT syndrome with scarce positive cases.

Abstract: Machine vision models, particularly deep neural networks, are increasingly applied to physiological signal interpretation, including electrocardiography (ECG), yet they typically require large training datasets and offer limited insight into the causal features underlying their predictions. This lack of data efficiency and interpretability constrains their clinical reliability and alignment with human reasoning. Here, we show that a perception-informed pseudo-colouring technique, previously demonstrated to enhance human ECG interpretation, can improve both explainability and few-shot learning in deep neural networks analysing complex physiological data. We focus on acquired, drug-induced long QT syndrome (LQTS) as a challenging case study characterised by heterogeneous signal morphology, variable heart rate, and scarce positive cases associated with life-threatening arrhythmias such as torsades de pointes. This setting provides a stringent test of model generalisation under extreme data scarcity. By encoding clinically salient temporal features, such as QT-interval duration, into structured colour representations, models learn discriminative and interpretable features from as few as one or five training examples. Using prototypical networks and a ResNet-18 architecture, we evaluate one-shot and few-shot learning on ECG images derived from single cardiac cycles and full 10-second rhythms. Explainability analyses show that pseudo-colouring guides attention toward clinically meaningful ECG features while suppressing irrelevant signal components. Aggregating multiple cardiac cycles further improves performance, mirroring human perceptual averaging across heartbeats. Together, these findings demonstrate that human-like perceptual encoding can bridge data efficiency, explainability, and causal reasoning in medical machine intelligence.

[155] VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement

Zhengfei Kuang, Rui Lin, Long Zhao, Gordon Wetzstein, Saining Xie, Sanghyun Woo

Main category: cs.CV

TL;DR: MLLMs extended for 3D scene manipulation via MCP-based API, visual tools for spatial understanding, and multi-agent framework for robust execution.

Details

Motivation: Despite MLLMs' success in 2D vision-language tasks, their application to complex 3D scene manipulation remains underexplored. The paper aims to bridge this gap by addressing three key challenges: weak visual grounding, limited 3D scene understanding, and error-prone iterative updates in 3D object arrangement tasks.

Method: 1) Introduces MCP-based API to shift from brittle raw code manipulation to robust function-level updates. 2) Augments MLLMs with specialized visual tools for scene analysis, spatial information gathering, and action validation. 3) Proposes collaborative multi-agent framework with designated roles for planning, execution, and verification to handle multi-step instructions and recover from errors.

Result: Demonstrates effectiveness on 25 complex object arrangement tasks, significantly outperforming existing baselines.

Conclusion: The proposed approach successfully bridges the gap between MLLMs and 3D scene manipulation by addressing visual grounding, spatial understanding, and error management challenges through API design, visual tools, and multi-agent collaboration.

Abstract: Despite the remarkable progress of Multimodal Large Language Models (MLLMs) in 2D vision-language tasks, their application to complex 3D scene manipulation remains underexplored. In this paper, we bridge this critical gap by tackling three key challenges in 3D object arrangement task using MLLMs. First, to address the weak visual grounding of MLLMs, which struggle to link programmatic edits with precise 3D outcomes, we introduce an MCP-based API. This shifts the interaction from brittle raw code manipulation to more robust, function-level updates. Second, we augment the MLLM’s 3D scene understanding with a suite of specialized visual tools to analyze scene state, gather spatial information, and validate action outcomes. This perceptual feedback loop is critical for closing the gap between language-based updates and precise 3D-aware manipulation. Third, to manage the iterative, error-prone updates, we propose a collaborative multi-agent framework with designated roles for planning, execution, and verification. This decomposition allows the system to robustly handle multi-step instructions and recover from intermediate errors. We demonstrate the effectiveness of our approach on a diverse set of 25 complex object arrangement tasks, where it significantly outperforms existing baselines. Website: vulcan-3d.github.io

[156] Self-Evaluation Unlocks Any-Step Text-to-Image Generation

Xin Yu, Xiaojuan Qi, Zhengqi Li, Kai Zhang, Richard Zhang, Zhe Lin, Eli Shechtman, Tianyu Wang, Yotam Nitzan

Main category: cs.CV

TL;DR: Self-E is a novel text-to-image model that learns from scratch with self-evaluation, enabling any-step inference without needing a pretrained teacher or many inference steps.

Details

Motivation: To bridge the gap between traditional diffusion/flow models (which need many inference steps) and distillation approaches (which require pretrained teachers), creating a unified model that works well at any step count from scratch.

Method: Combines Flow Matching training with a novel self-evaluation mechanism where the model evaluates its own generated samples using current score estimates, acting as a dynamic self-teacher that provides both local supervision and global matching.

Result: Self-E excels at few-step generation while being competitive with state-of-the-art Flow Matching models at 50 steps, with performance improving monotonically as inference steps increase, enabling both ultra-fast few-step and high-quality long-trajectory sampling in one model.

Conclusion: Self-E is the first from-scratch, any-step text-to-image model, offering a unified framework for efficient and scalable generation that doesn’t rely on pretrained teachers or many inference steps.

Abstract: We introduce the Self-Evaluating Model (Self-E), a novel, from-scratch training approach for text-to-image generation that supports any-step inference. Self-E learns from data similarly to a Flow Matching model, while simultaneously employing a novel self-evaluation mechanism: it evaluates its own generated samples using its current score estimates, effectively serving as a dynamic self-teacher. Unlike traditional diffusion or flow models, it does not rely solely on local supervision, which typically necessitates many inference steps. Unlike distillation-based approaches, it does not require a pretrained teacher. This combination of instantaneous local learning and self-driven global matching bridges the gap between the two paradigms, enabling the training of a high-quality text-to-image model from scratch that excels even at very low step counts. Extensive experiments on large-scale text-to-image benchmarks show that Self-E not only excels in few-step generation, but is also competitive with state-of-the-art Flow Matching models at 50 steps. We further find that its performance improves monotonically as inference steps increase, enabling both ultra-fast few-step generation and high-quality long-trajectory sampling within a single unified model. To our knowledge, Self-E is the first from-scratch, any-step text-to-image model, offering a unified framework for efficient and scalable generation.

[157] iOSPointMapper: RealTime Pedestrian and Accessibility Mapping with Mobile AI

Himanshu Naidu, Yuxiang Zhang, Sachin Mehta, Anat Caspi

Main category: cs.CV

TL;DR: iOSPointMapper is a mobile app that uses iPhones/iPads to map sidewalks in real-time with privacy protection, combining semantic segmentation, LiDAR depth, and GPS/IMU data to detect pedestrian infrastructure features.

Details

Motivation: Current sidewalk data collection methods are costly, fragmented, and difficult to scale, creating barriers to building accessible and inclusive pedestrian infrastructure. There's a need for scalable, real-time mapping solutions.

Method: The system uses on-device semantic segmentation, LiDAR-based depth estimation, and fused GPS/IMU data to detect sidewalk features (traffic signs, lights, poles). It includes a user-guided annotation interface for validation, anonymizes data, and transmits to the Transportation Data Exchange Initiative (TDEI).

Result: Detailed evaluations show strong performance in feature detection and spatial mapping, demonstrating the application’s potential for enhanced pedestrian mapping and closing critical data gaps.

Conclusion: iOSPointMapper offers a scalable, user-centered, privacy-conscious approach to sidewalk mapping that can integrate with broader transportation datasets, addressing critical data gaps in pedestrian infrastructure.

Abstract: Accurate, up-to-date sidewalk data is essential for building accessible and inclusive pedestrian infrastructure, yet current approaches to data collection are often costly, fragmented, and difficult to scale. We introduce iOSPointMapper, a mobile application that enables real-time, privacy-conscious sidewalk mapping on the ground, using recent-generation iPhones and iPads. The system leverages on-device semantic segmentation, LiDAR-based depth estimation, and fused GPS/IMU data to detect and localize sidewalk-relevant features such as traffic signs, traffic lights and poles. To ensure transparency and improve data quality, iOSPointMapper incorporates a user-guided annotation interface for validating system outputs before submission. Collected data is anonymized and transmitted to the Transportation Data Exchange Initiative (TDEI), where it integrates seamlessly with broader multimodal transportation datasets. Detailed evaluations of the system’s feature detection and spatial mapping performance reveal the application’s potential for enhanced pedestrian mapping. Together, these capabilities offer a scalable and user-centered approach to closing critical data gaps in pedestrian

[158] DeFloMat: Detection with Flow Matching for Stable and Efficient Generative Object Localization

Hansang Lee, Chaelin Lee, Nieun Seo, Joon Seok Lim, Helen Hong

Main category: cs.CV

TL;DR: DeFloMat is a fast generative object detection framework using Conditional Flow Matching that achieves state-of-the-art accuracy in just 3 inference steps, solving the latency bottleneck of diffusion-based detectors for clinical applications.

Details

Motivation: Diffusion-based detectors like DiffusionDet have high accuracy but require many sampling steps (T ≫ 60), making them impractical for time-sensitive clinical applications such as Crohn's Disease detection in Magnetic Resonance Enterography (MRE). There's a need to resolve the trade-off between generative accuracy and clinical efficiency.

Method: DeFloMat replaces the slow stochastic denoising process of diffusion models with a highly direct, deterministic flow field derived from Conditional Optimal Transport theory, specifically approximating Rectified Flow. This enables fast inference via a simple Ordinary Differential Equation (ODE) solver.

Result: DeFloMat achieves state-of-the-art accuracy (43.32% AP₁₀:₅₀) in only 3 inference steps, representing a 1.4× performance improvement over DiffusionDet’s maximum converged performance (31.03% AP₁₀:₅₀ at 4 steps). It also shows superior localization characteristics with better Recall and stability in the few-step regime.

Conclusion: DeFloMat resolves the trade-off between generative accuracy and clinical efficiency, setting a new standard for stable and rapid object localization, making it practical for time-sensitive clinical applications like Crohn’s Disease detection in MRE.

Abstract: We propose DeFloMat (Detection with Flow Matching), a novel generative object detection framework that addresses the critical latency bottleneck of diffusion-based detectors, such as DiffusionDet, by integrating Conditional Flow Matching (CFM). Diffusion models achieve high accuracy by formulating detection as a multi-step stochastic denoising process, but their reliance on numerous sampling steps ($T \gg 60$) makes them impractical for time-sensitive clinical applications like Crohn’s Disease detection in Magnetic Resonance Enterography (MRE). DeFloMat replaces this slow stochastic path with a highly direct, deterministic flow field derived from Conditional Optimal Transport (OT) theory, specifically approximating the Rectified Flow. This shift enables fast inference via a simple Ordinary Differential Equation (ODE) solver. We demonstrate the superiority of DeFloMat on a challenging MRE clinical dataset. Crucially, DeFloMat achieves state-of-the-art accuracy ($43.32% \text{ } AP_{10:50}$) in only $3$ inference steps, which represents a $1.4\times$ performance improvement over DiffusionDet’s maximum converged performance ($31.03% \text{ } AP_{10:50}$ at $4$ steps). Furthermore, our deterministic flow significantly enhances localization characteristics, yielding superior Recall and stability in the few-step regime. DeFloMat resolves the trade-off between generative accuracy and clinical efficiency, setting a new standard for stable and rapid object localization.

[159] Bright 4B: Scaling Hyperspherical Learning for Segmentation in 3D Brightfield Microscopy

Amil Khan, Matheus Palhares Viana, Suraj Mishra, B. S. Manjunath

Main category: cs.CV

TL;DR: Bright-4B is a 4B parameter foundation model that segments subcellular structures directly from 3D brightfield microscopy volumes without fluorescence or heavy post-processing.

Details

Motivation: Label-free 3D brightfield microscopy is fast and noninvasive but lacks robust volumetric segmentation capabilities without fluorescence or extensive post-processing.

Method: Combines Native Sparse Attention (local/coarse/global context), depth-width residual HyperConnections, soft Mixture-of-Experts, and anisotropic patch embedding for geometry-faithful 3D tokenization.

Result: Produces morphology-accurate segmentations of nuclei, mitochondria, and other organelles from brightfield alone, outperforming CNN and Transformer baselines across multiple confocal datasets.

Conclusion: Bright-4B enables large-scale, label-free 3D cell mapping with fine structural detail preservation, with code and models released for community use.

Abstract: Label-free 3D brightfield microscopy offers a fast and noninvasive way to visualize cellular morphology, yet robust volumetric segmentation still typically depends on fluorescence or heavy post-processing. We address this gap by introducing Bright-4B, a 4 billion parameter foundation model that learns on the unit hypersphere to segment subcellular structures directly from 3D brightfield volumes. Bright-4B combines a hardware-aligned Native Sparse Attention mechanism (capturing local, coarse, and selected global context), depth-width residual HyperConnections that stabilize representation flow, and a soft Mixture-of-Experts for adaptive capacity. A plug-and-play anisotropic patch embed further respects confocal point-spread and axial thinning, enabling geometry-faithful 3D tokenization. The resulting model produces morphology-accurate segmentations of nuclei, mitochondria, and other organelles from brightfield stacks alone–without fluorescence, auxiliary channels, or handcrafted post-processing. Across multiple confocal datasets, Bright-4B preserves fine structural detail across depth and cell types, outperforming contemporary CNN and Transformer baselines. All code, pretrained weights, and models for downstream finetuning will be released to advance large-scale, label-free 3D cell mapping.

[160] FluenceFormer: Transformer-Driven Multi-Beam Fluence Map Regression for Radiotherapy Planning

Ujunwa Mgboh, Rafi Ibn Sultan, Joshua Kim, Kundan Thind, Dongxiao Zhu

Main category: cs.CV

TL;DR: FluenceFormer is a transformer-based framework for radiotherapy fluence map prediction that uses a two-stage design with physics-informed loss to address long-range dependency issues in prior convolutional methods.

Details

Motivation: Fluence map prediction is an ill-posed inverse problem in radiotherapy planning. Existing convolutional methods struggle with long-range dependencies, leading to structurally inconsistent or physically unrealizable plans.

Method: A backbone-agnostic transformer framework with two-stage design: Stage 1 predicts global dose prior from anatomical inputs; Stage 2 conditions this prior on explicit beam geometry to regress physically calibrated fluence maps. Uses Fluence-Aware Regression (FAR) loss integrating voxel-level fidelity, gradient smoothness, structural consistency, and beam-wise energy conservation.

Result: FluenceFormer with Swin UNETR achieves strongest performance among evaluated models (Swin UNETR, UNETR, nnFormer, MedFormer), reducing Energy Error to 4.5% and yielding statistically significant gains in structural fidelity (p < 0.05) over existing CNN and single-stage methods.

Conclusion: FluenceFormer provides an effective transformer-based solution for fluence map prediction that addresses long-range dependency issues through its two-stage design and physics-informed loss function, demonstrating improved performance over convolutional approaches.

Abstract: Fluence map prediction is central to automated radiotherapy planning but remains an ill-posed inverse problem due to the complex relationship between volumetric anatomy and beam-intensity modulation. Convolutional methods in prior work often struggle to capture long-range dependencies, which can lead to structurally inconsistent or physically unrealizable plans. We introduce \textbf{FluenceFormer}, a backbone-agnostic transformer framework for direct, geometry-aware fluence regression. The model uses a unified two-stage design: Stage~~1 predicts a global dose prior from anatomical inputs, and Stage~~2 conditions this prior on explicit beam geometry to regress physically calibrated fluence maps. Central to the approach is the \textbf{Fluence-Aware Regression (FAR)} loss, a physics-informed objective that integrates voxel-level fidelity, gradient smoothness, structural consistency, and beam-wise energy conservation. We evaluate the generality of the framework across multiple transformer backbones, including Swin UNETR, UNETR, nnFormer, and MedFormer, using a prostate IMRT dataset. FluenceFormer with Swin UNETR achieves the strongest performance among the evaluated models and improves over existing benchmark CNN and single-stage methods, reducing Energy Error to $\mathbf{4.5%}$ and yielding statistically significant gains in structural fidelity ($p < 0.05$).

[161] EmoCtrl: Controllable Emotional Image Content Generation

Jingyuan Yang, Weibin Luo, Hui Huang

Main category: cs.CV

TL;DR: EmoCtrl enables controllable emotional image generation that maintains content fidelity while expressing target emotions, bridging the gap between content-consistent and emotion-aware models.

Details

Motivation: Current text-to-image models lack emotional awareness, while emotion-driven models sacrifice content accuracy. There's a need for models that can generate images faithful to content descriptions while expressing specific emotions.

Method: Proposes EmoCtrl with textual and visual emotion enhancement modules, using a dataset annotated with content, emotion, and affective prompts. The system learns emotion tokens that bridge abstract emotions to visual cues.

Result: EmoCtrl outperforms existing methods in content faithfulness and emotional expression, with user studies confirming strong human preference alignment. The learned emotion tokens show complementary effects and good generalization to creative applications.

Conclusion: EmoCtrl successfully addresses the C-EICG problem by achieving both content fidelity and emotional expressiveness, with learned emotion tokens demonstrating robustness and adaptability across various applications.

Abstract: An image conveys meaning through both its visual content and emotional tone, jointly shaping human perception. We introduce Controllable Emotional Image Content Generation (C-EICG), which aims to generate images that remain faithful to a given content description while expressing a target emotion. Existing text-to-image models ensure content consistency but lack emotional awareness, whereas emotion-driven models generate affective results at the cost of content distortion. To address this gap, we propose EmoCtrl, supported by a dataset annotated with content, emotion, and affective prompts, bridging abstract emotions to visual cues. EmoCtrl incorporates textual and visual emotion enhancement modules that enrich affective expression via descriptive semantics and perceptual cues. The learned emotion tokens exhibit complementary effects, as demonstrated through ablations and visualizations. Quantatitive and qualatitive experiments demonstrate that EmoCtrl achieves faithful content and expressive emotion control, outperforming existing methods across multiple aspects. User studies confirm EmoCtrl’s strong alignment with human preference. Moreover, EmoCtrl generalizes well to creative applications, further demonstrating the robustness and adaptability of the learned emotion tokens.

[162] SuperiorGAT: Graph Attention Networks for Sparse LiDAR Point Cloud Reconstruction in Autonomous Systems

Khalfalla Awedat, Mohamed Abidalrekab, Gurcan Comert, Mustafa Ayad

Main category: cs.CV

TL;DR: SuperiorGAT is a graph attention framework that reconstructs missing elevation information in sparse LiDAR point clouds using beam-aware graphs and gated residual fusion, achieving better reconstruction than existing methods without increasing network depth.

Details

Motivation: LiDAR perception in autonomous systems faces limitations due to fixed vertical beam resolution and beam dropout from environmental occlusions, leading to sparse and incomplete point clouds that hinder accurate perception.

Method: Models LiDAR scans as beam-aware graphs and uses graph attention networks with gated residual fusion and feed-forward refinement to reconstruct missing elevation information. Simulates structured beam dropout by removing every fourth vertical scanning beam for evaluation.

Result: Extensive experiments on KITTI environments show SuperiorGAT achieves lower reconstruction error and improved geometric consistency compared to PointNet-based models and deeper GAT baselines. X-Z projections confirm preservation of structural integrity with minimal vertical distortion.

Conclusion: Architectural refinement through SuperiorGAT offers a computationally efficient method for improving LiDAR resolution without requiring additional sensor hardware, addressing beam dropout and sparse point cloud issues in autonomous perception systems.

Abstract: LiDAR-based perception in autonomous systems is constrained by fixed vertical beam resolution and further compromised by beam dropout resulting from environmental occlusions. This paper introduces SuperiorGAT, a graph attention-based framework designed to reconstruct missing elevation information in sparse LiDAR point clouds. By modeling LiDAR scans as beam-aware graphs and incorporating gated residual fusion with feed-forward refinement, SuperiorGAT enables accurate reconstruction without increasing network depth. To evaluate performance, structured beam dropout is simulated by removing every fourth vertical scanning beam. Extensive experiments across diverse KITTI environments, including Person, Road, Campus, and City sequences, demonstrate that SuperiorGAT consistently achieves lower reconstruction error and improved geometric consistency compared to PointNet-based models and deeper GAT baselines. Qualitative X-Z projections further confirm the model’s ability to preserve structural integrity with minimal vertical distortion. These results suggest that architectural refinement offers a computationally efficient method for improving LiDAR resolution without requiring additional sensor hardware.

[163] LECalib: Line-Based Event Camera Calibration

Zibin Liu, Banglei Guana, Yang Shanga, Zhenbao Yu, Yifei Bian, Qifeng Yu

Main category: cs.CV

TL;DR: A line-based event camera calibration method that uses geometric lines from man-made objects instead of traditional calibration patterns, enabling faster calibration without manual setup.

Details

Motivation: Current event camera calibration methods are time-consuming and require manually placed calibration objects, which cannot meet the needs of rapidly changing scenarios. There's a need for more efficient calibration approaches that work in dynamic environments.

Method: Proposes a line-based calibration framework that detects lines directly from event streams using geometric lines from common man-made objects (doors, windows, boxes). Uses an event-line calibration model to generate initial camera parameter guesses, suitable for both planar and non-planar lines, followed by non-linear optimization refinement.

Result: The method demonstrates feasibility and accuracy in both simulation and real-world experiments, validated on monocular and stereo event cameras. The approach works effectively without traditional calibration patterns.

Conclusion: The proposed line-based calibration method provides an efficient alternative to traditional event camera calibration, enabling faster calibration in dynamic environments without requiring manual calibration objects.

Abstract: Camera calibration is an essential prerequisite for event-based vision applications. Current event camera calibration methods typically involve using flashing patterns, reconstructing intensity images, and utilizing the features extracted from events. Existing methods are generally time-consuming and require manually placed calibration objects, which cannot meet the needs of rapidly changing scenarios. In this paper, we propose a line-based event camera calibration framework exploiting the geometric lines of commonly-encountered objects in man-made environments, e.g., doors, windows, boxes, etc. Different from previous methods, our method detects lines directly from event streams and leverages an event-line calibration model to generate the initial guess of camera parameters, which is suitable for both planar and non-planar lines. Then, a non-linear optimization is adopted to refine camera parameters. Both simulation and real-world experiments have demonstrated the feasibility and accuracy of our method, with validation performed on monocular and stereo event cameras. The source code is released at https://github.com/Zibin6/line_based_event_camera_calib.

[164] Towards Robust Optical-SAR Object Detection under Missing Modalities: A Dynamic Quality-Aware Fusion Framework

Zhicheng Zhao, Yuancheng Xu, Andong Lu, Chenglong Li, Jin Tang

Main category: cs.CV

TL;DR: QDFNet is a quality-aware dynamic fusion network for robust optical-SAR object detection that handles missing or degraded modalities through learnable reference tokens and orthogonal constraint fusion.

Details

Motivation: Optical-SAR fusion for object detection faces practical limitations due to imaging differences, temporal asynchrony, and registration issues, leading to misaligned or missing modality data. Existing methods lack robustness to random missing modalities and effective mechanisms for consistent performance improvement in fusion-based detection.

Method: Proposes QDFNet with two key modules: 1) Dynamic Modality Quality Assessment (DMQA) using learnable reference tokens to iteratively assess feature reliability and identify degraded regions, and 2) Orthogonal Constraint Normalization Fusion (OCNF) that preserves modality independence while dynamically adjusting fusion weights based on reliability scores to suppress unreliable feature propagation.

Result: Extensive experiments on SpaceNet6-OTD and OGSOD-2.0 datasets demonstrate QDFNet’s superiority over state-of-the-art methods, particularly under partial modality corruption or missing data scenarios.

Conclusion: QDFNet effectively addresses the challenges of optical-SAR fusion by dynamically assessing feature quality and adaptively fusing modalities, providing robust object detection performance even with missing or degraded data.

Abstract: Optical and Synthetic Aperture Radar (SAR) fusion-based object detection has attracted significant research interest in remote sensing, as these modalities provide complementary information for all-weather monitoring. However, practical deployment is severely limited by inherent challenges. Due to distinct imaging mechanisms, temporal asynchrony, and registration difficulties, obtaining well-aligned optical-SAR image pairs remains extremely difficult, frequently resulting in missing or degraded modality data. Although recent approaches have attempted to address this issue, they still suffer from limited robustness to random missing modalities and lack effective mechanisms to ensure consistent performance improvement in fusion-based detection. To address these limitations, we propose a novel Quality-Aware Dynamic Fusion Network (QDFNet) for robust optical-SAR object detection. Our proposed method leverages learnable reference tokens to dynamically assess feature reliability and guide adaptive fusion in the presence of missing modalities. In particular, we design a Dynamic Modality Quality Assessment (DMQA) module that employs learnable reference tokens to iteratively refine feature reliability assessment, enabling precise identification of degraded regions and providing quality guidance for subsequent fusion. Moreover, we develop an Orthogonal Constraint Normalization Fusion (OCNF) module that employs orthogonal constraints to preserve modality independence while dynamically adjusting fusion weights based on reliability scores, effectively suppressing unreliable feature propagation. Extensive experiments on the SpaceNet6-OTD and OGSOD-2.0 datasets demonstrate the superiority and effectiveness of QDFNet compared to state-of-the-art methods, particularly under partial modality corruption or missing data scenarios.

[165] SonoVision: A Computer Vision Approach for Helping Visually Challenged Individuals Locate Objects with the Help of Sound Cues

Md Abu Obaida Zishan, Annajiat Alim Rasel

Main category: cs.CV

TL;DR: SonoVision is a smartphone app that helps visually impaired people locate everyday objects using directional sound cues through headphones.

Details

Motivation: Visually impaired individuals face significant challenges in locating objects, which hinders their independence and can lead to dangerous situations. There's a need to help them become more self-sufficient and reduce reliance on others.

Method: Built with Flutter, the app uses the Efficientdet-D2 model for object detection. It provides directional audio cues through headphones: sinusoidal sounds in left/right ear for objects on respective sides, and simultaneous sounds in both ears for objects directly in front.

Result: Developed a functional smartphone application that works completely offline, providing a safe and user-friendly way for visually impaired individuals to locate objects using audio cues.

Conclusion: SonoVision can significantly assist visually impaired people by helping them locate everyday objects independently through intuitive sound cues, potentially increasing their safety and self-sufficiency.

Abstract: Locating objects for the visually impaired is a significant challenge and is something no one can get used to over time. However, this hinders their independence and could push them towards risky and dangerous scenarios. Hence, in the spirit of making the visually challenged more self-sufficient, we present SonoVision, a smart-phone application that helps them find everyday objects using sound cues through earphones/headphones. This simply means, if an object is on the right or left side of a user, the app makes a sinusoidal sound in a user’s respective ear through ear/headphones. However, to indicate objects located directly in front, both the left and right earphones are rung simultaneously. These sound cues could easily help a visually impaired individual locate objects with the help of their smartphones and reduce the reliance on people in their surroundings, consequently making them more independent. This application is made with the flutter development platform and uses the Efficientdet-D2 model for object detection in the backend. We believe the app will significantly assist the visually impaired in a safe and user-friendly manner with its capacity to work completely offline. Our application can be accessed here https://github.com/MohammedZ666/SonoVision.git.

[166] SAM 3D for 3D Object Reconstruction from Remote Sensing Images

Junsheng Yao, Lichao Mou, Qingyu Li

Main category: cs.CV

TL;DR: SAM 3D foundation model outperforms TRELLIS for monocular building reconstruction from remote sensing imagery, producing better roof geometry and boundaries, with potential for urban scene modeling.

Details

Motivation: Existing methods for monocular 3D building reconstruction require task-specific architectures and intensive supervision, limiting scalability for urban modeling. There's a need for general-purpose foundation models that can handle this task effectively.

Method: Systematic evaluation of SAM 3D (general-purpose image-to-3D foundation model) against TRELLIS on NYC Urban Dataset samples using FID and CLIP-based Maximum Mean Discrepancy metrics. Extended SAM 3D to urban scene reconstruction via segment-reconstruct-compose pipeline.

Result: SAM 3D produces more coherent roof geometry and sharper boundaries compared to TRELLIS. The model shows potential for urban scene modeling through the segment-reconstruct-compose approach.

Conclusion: SAM 3D demonstrates superior performance for monocular building reconstruction, providing practical guidance for deploying foundation models in urban 3D reconstruction. Future work should integrate scene-level structural priors to enhance capabilities.

Abstract: Monocular 3D building reconstruction from remote sensing imagery is essential for scalable urban modeling, yet existing methods often require task-specific architectures and intensive supervision. This paper presents the first systematic evaluation of SAM 3D, a general-purpose image-to-3D foundation model, for monocular remote sensing building reconstruction. We benchmark SAM 3D against TRELLIS on samples from the NYC Urban Dataset, employing Frechet Inception Distance (FID) and CLIP-based Maximum Mean Discrepancy (CMMD) as evaluation metrics. Experimental results demonstrate that SAM 3D produces more coherent roof geometry and sharper boundaries compared to TRELLIS. We further extend SAM 3D to urban scene reconstruction through a segment-reconstruct-compose pipeline, demonstrating its potential for urban scene modeling. We also analyze practical limitations and discuss future research directions. These findings provide practical guidance for deploying foundation models in urban 3D reconstruction and motivate future integration of scene-level structural priors.

[167] Comparing Object Detection Models for Electrical Substation Component Mapping

Haley Mody, Namish Bansal, Dennies Kiprono Bor, Edward J. Oughton

Main category: cs.CV

TL;DR: This paper compares three computer vision models (YOLOv8, YOLOv11, RF-DETR) for autonomous detection and mapping of electrical substation components to improve grid vulnerability assessment.

Details

Motivation: Electrical substations are critical infrastructure vulnerable to various hazards, and manual mapping is time-consuming and labor-intensive. Autonomous computer vision solutions are needed for efficient vulnerability assessment of grid components.

Method: Train and compare three object detection models (YOLOv8, YOLOv11, RF-DETR) on a manually labeled dataset of US substation images, evaluating detection accuracy, precision, and efficiency.

Result: The models are evaluated for their performance in detecting substation components, with analysis of each model’s strengths and limitations. The best-performing model is identified for reliable large-scale substation component mapping.

Conclusion: Computer vision models can effectively automate substation component mapping, providing a scalable solution for grid vulnerability assessment. The research demonstrates a practical use case for machine learning in critical infrastructure monitoring.

Abstract: Electrical substations are a significant component of an electrical grid. Indeed, the assets at these substations (e.g., transformers) are prone to disruption from many hazards, including hurricanes, flooding, earthquakes, and geomagnetically induced currents (GICs). As electrical grids are considered critical national infrastructure, any failure can have significant economic and public safety implications. To help prevent and mitigate these failures, it is thus essential that we identify key substation components to quantify vulnerability. Unfortunately, traditional manual mapping of substation infrastructure is time-consuming and labor-intensive. Therefore, an autonomous solution utilizing computer vision models is preferable, as it allows for greater convenience and efficiency. In this research paper, we train and compare the outputs of 3 models (YOLOv8, YOLOv11, RF-DETR) on a manually labeled dataset of US substation images. Each model is evaluated for detection accuracy, precision, and efficiency. We present the key strengths and limitations of each model, identifying which provides reliable and large-scale substation component mapping. Additionally, we utilize these models to effectively map the various substation components in the United States, showcasing a use case for machine learning in substation mapping.

[168] Pose-Guided Residual Refinement for Interpretable Text-to-Motion Generation and Editing

Sukhyun Jeong, Yong-Hoon Choi

Main category: cs.CV

TL;DR: PGR²M introduces a hybrid motion representation combining interpretable pose codes with residual codes to improve text-based 3D motion generation and editing by capturing both global structure and fine-grained temporal details.

Details

Motivation: Existing pose-code-based frameworks like CoMo struggle to capture subtle temporal dynamics and high-frequency details due to their frame-wise representation, which degrades reconstruction fidelity and local controllability in text-based motion generation and editing.

Method: Proposes PGR²M, a hybrid representation that augments interpretable pose codes with residual codes learned via residual vector quantization (RVQ). Uses a pose-guided RVQ tokenizer to decompose motion into pose latents (coarse global structure) and residual latents (fine-grained temporal variations). Includes residual dropout to prevent over-reliance on residuals. Employs two Transformers: a base Transformer for pose code prediction from text, and a refine Transformer for residual code prediction conditioned on text, pose codes, and quantization stage.

Result: Experiments on HumanML3D and KIT-ML show improved Fréchet inception distance and reconstruction metrics for both generation and editing compared to CoMo and recent diffusion- and tokenization-based baselines. User studies confirm intuitive, structure-preserving motion edits.

Conclusion: PGR²M successfully addresses limitations of pose-code-based frameworks by introducing a hybrid representation that preserves semantic alignment and editability while capturing fine-grained temporal details, enabling better text-based motion generation and editing.

Abstract: Text-based 3D motion generation aims to automatically synthesize diverse motions from natural-language descriptions to extend user creativity, whereas motion editing modifies an existing motion sequence in response to text while preserving its overall structure. Pose-code-based frameworks such as CoMo map quantifiable pose attributes into discrete pose codes that support interpretable motion control, but their frame-wise representation struggles to capture subtle temporal dynamics and high-frequency details, often degrading reconstruction fidelity and local controllability. To address this limitation, we introduce pose-guided residual refinement for motion (PGR$^2$M), a hybrid representation that augments interpretable pose codes with residual codes learned via residual vector quantization (RVQ). A pose-guided RVQ tokenizer decomposes motion into pose latents that encode coarse global structure and residual latents that model fine-grained temporal variations. Residual dropout further discourages over-reliance on residuals, preserving the semantic alignment and editability of the pose codes. On top of this tokenizer, a base Transformer autoregressively predicts pose codes from text, and a refine Transformer predicts residual codes conditioned on text, pose codes, and quantization stage. Experiments on HumanML3D and KIT-ML show that PGR$^2$M improves Fréchet inception distance and reconstruction metrics for both generation and editing compared with CoMo and recent diffusion- and tokenization-based baselines, while user studies confirm that it enables intuitive, structure-preserving motion edits.

[169] Event-based high temporal resolution measurement of shock wave motion field

Taihang Lei, Banglei Guan, Minzu Liang, Pengju Sun, Jing Tao, Yang Shang, Qifeng Yu

Main category: cs.CV

TL;DR: A novel framework using multiple event cameras achieves high-precision measurement of shock wave motion parameters with high spatiotemporal resolution, enabling multi-angle measurement, motion field reconstruction, and explosive equivalence inversion.

Details

Motivation: Accurate measurement of shock wave motion parameters with high spatiotemporal resolution is essential for power field testing and damage assessment, but current methods face challenges from fast, uneven propagation and unstable testing conditions.

Method: A framework using multiple event cameras establishes a polar coordinate system to encode events revealing propagation patterns, extracts shock wave front events via iterative slope analysis, and derives geometric models for 3D reconstruction based on event-based optical imaging.

Result: The method achieves high-precision measurement with maximum error of 5.20% and minimum error of 0.06% compared to pressure sensors and empirical formulas, enabling multi-angle shock wave measurement, motion field reconstruction, and explosive equivalence inversion.

Conclusion: The proposed event camera-based framework represents significant progress in shock wave measurement, achieving both high spatial and temporal resolution for accurate motion field analysis in challenging conditions.

Abstract: Accurate measurement of shock wave motion parameters with high spatiotemporal resolution is essential for applications such as power field testing and damage assessment. However, significant challenges are posed by the fast, uneven propagation of shock waves and unstable testing conditions. To address these challenges, a novel framework is proposed that utilizes multiple event cameras to estimate the asymmetry of shock waves, leveraging its high-speed and high-dynamic range capabilities. Initially, a polar coordinate system is established, which encodes events to reveal shock wave propagation patterns, with adaptive region-of-interest (ROI) extraction through event offset calculations. Subsequently, shock wave front events are extracted using iterative slope analysis, exploiting the continuity of velocity changes. Finally, the geometric model of events and shock wave motion parameters is derived according to event-based optical imaging model, along with the 3D reconstruction model. Through the above process, multi-angle shock wave measurement, motion field reconstruction, and explosive equivalence inversion are achieved. The results of the speed measurement are compared with those of the pressure sensors and the empirical formula, revealing a maximum error of 5.20% and a minimum error of 0.06%. The experimental results demonstrate that our method achieves high-precision measurement of the shock wave motion field with both high spatial and temporal resolution, representing significant progress.

[170] Scalpel-SAM: A Semi-Supervised Paradigm for Adapting SAM to Infrared Small Object Detection

Zihan Liu, Xiangning Ren, Dezhang Kong, Yipeng Zhang, Meng Han

Main category: cs.CV

TL;DR: A semi-supervised paradigm for infrared small object detection using hierarchical MoE adapters to distill SAM into Scalpel-SAM, enabling efficient downstream models with minimal annotations.

Details

Motivation: Infrared small object detection suffers from high annotation costs, and existing methods like SAM face domain gaps, inability to encode physical priors, and architectural complexity.

Method: Two-stage paradigm: 1) Prior-Guided Knowledge Distillation using hierarchical MoE adapter with 10% supervised data to distill SAM into Scalpel-SAM expert teacher; 2) Deployment-Oriented Knowledge Transfer using Scalpel-SAM to generate pseudo labels for training lightweight downstream models.

Result: With minimal annotations, downstream models achieve performance comparable to or surpassing their fully supervised counterparts.

Conclusion: First semi-supervised paradigm that systematically addresses data scarcity in IR-SOT using SAM as teacher model, enabling efficient deployment with minimal annotation requirements.

Abstract: Infrared small object detection urgently requires semi-supervised paradigms due to the high cost of annotation. However, existing methods like SAM face significant challenges of domain gaps, inability of encoding physical priors, and inherent architectural complexity. To address this, we designed a Hierarchical MoE Adapter consisting of four white-box neural operators. Building upon this core component, we propose a two-stage paradigm for knowledge distillation and transfer: (1) Prior-Guided Knowledge Distillation, where we use our MoE adapter and 10% of available fully supervised data to distill SAM into an expert teacher (Scalpel-SAM); and (2) Deployment-Oriented Knowledge Transfer, where we use Scalpel-SAM to generate pseudo labels for training lightweight and efficient downstream models. Experiments demonstrate that with minimal annotations, our paradigm enables downstream models to achieve performance comparable to, or even surpassing, their fully supervised counterparts. To our knowledge, this is the first semi-supervised paradigm that systematically addresses the data scarcity issue in IR-SOT using SAM as the teacher model.

[171] Tracking by Predicting 3-D Gaussians Over Time

Tanish Baranwal, Himanshu Gaurav Singh, Jathushan Rajasegaran, Jitendra Malik

Main category: cs.CV

TL;DR: Video-GMAE uses Gaussian splats to represent videos as dynamic 3D scenes, enabling self-supervised learning with emergent tracking capabilities that achieve state-of-the-art performance.

Details

Motivation: The paper aims to develop a self-supervised video representation learning approach that leverages the natural 3D structure underlying 2D videos, using Gaussian representations to capture dynamic scene properties.

Method: Proposes Video Gaussian Masked Autoencoders (Video-GMAE) that encode video sequences into sets of Gaussian splats moving over time, enforcing the inductive bias that 2D videos are projections of dynamic 3D scenes.

Result: The model shows emergent tracking behavior from pretraining alone, achieving zero-shot tracking comparable to SOTA. With finetuning, it achieves 34.6% improvement on Kinetics and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches.

Conclusion: Representing videos as dynamic Gaussian sets provides an effective self-supervised learning framework that naturally captures 3D scene dynamics and enables strong video understanding and tracking capabilities.

Abstract: We propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pretraining a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches. The project page and code are publicly available at https://videogmae.org/ and https://github.com/tekotan/video-gmae.

[172] SCAFusion: A Multimodal 3D Detection Framework for Small Object Detection in Lunar Surface Exploration

Xin Chen, Kang Luo, Yangyi Xiao, Hesheng Wang

Main category: cs.CV

TL;DR: SCAFusion is a multimodal 3D object detection model for lunar robotics that improves small, irregular object detection through cognitive adapters, contrastive alignment, and section-aware coordinate attention, achieving significant performance gains over baselines.

Details

Motivation: Existing multimodal 3D perception methods for autonomous driving perform poorly in lunar environments due to poor feature alignment, limited multimodal synergy, and weak small object detection capabilities, which is critical for detecting meteor fragments and rocks during lunar surface exploration.

Method: Built on BEVFusion framework, SCAFusion integrates: 1) Cognitive Adapter for efficient camera backbone tuning, 2) Contrastive Alignment Module for camera-LiDAR feature consistency, 3) Camera Auxiliary Training Branch for visual representation strengthening, and 4) Section-aware Coordinate Attention mechanism for small, irregular target detection.

Result: Achieves 69.7% mAP and 72.1% NDS on nuScenes validation set (5.0% and 2.7% improvement over baseline). In simulated lunar environments on Isaac Sim, achieves 90.93% mAP (11.5% improvement), with notable gains in detecting small meteor-like obstacles.

Conclusion: SCAFusion effectively addresses the limitations of existing multimodal 3D perception methods for lunar environments, providing reliable detection of small, irregular objects with minimal parameter/computation overhead, making it suitable for lunar robotic missions.

Abstract: Reliable and precise detection of small and irregular objects, such as meteor fragments and rocks, is critical for autonomous navigation and operation in lunar surface exploration. Existing multimodal 3D perception methods designed for terrestrial autonomous driving often underperform in off world environments due to poor feature alignment, limited multimodal synergy, and weak small object detection. This paper presents SCAFusion, a multimodal 3D object detection model tailored for lunar robotic missions. Built upon the BEVFusion framework, SCAFusion integrates a Cognitive Adapter for efficient camera backbone tuning, a Contrastive Alignment Module to enhance camera LiDAR feature consistency, a Camera Auxiliary Training Branch to strengthen visual representation, and most importantly, a Section aware Coordinate Attention mechanism explicitly designed to boost the detection performance of small, irregular targets. With negligible increase in parameters and computation, our model achieves 69.7% mAP and 72.1% NDS on the nuScenes validation set, improving the baseline by 5.0% and 2.7%, respectively. In simulated lunar environments built on Isaac Sim, SCAFusion achieves 90.93% mAP, outperforming the baseline by 11.5%, with notable gains in detecting small meteor like obstacles.

[173] DreamOmni3: Scribble-based Editing and Generation

Bin Xia, Bohao Peng, Jiyang Liu, Sitong Wu, Jingyao Li, Junjia Huang, Xu Zhao, Yitong Wang, Ruihang Chu, Bei Yu, Jiaya Jia

Main category: cs.CV

TL;DR: DreamOmni3 introduces scribble-based editing and generation tasks using multimodal inputs (text, images, sketches) with a novel joint input scheme that avoids binary masks, achieving state-of-the-art performance.

Details

Motivation: Existing unified generation/editing models rely heavily on text prompts, which often fail to capture precise edit locations and fine-grained visual details that users intend. There's a need for more flexible creation tools that combine textual, image, and freehand sketch inputs.

Method: Proposes two main components: 1) Data synthesis pipeline for scribble-based editing (4 tasks) and generation (3 tasks) using extracted editable regions overlaid with hand-drawn elements, and 2) A framework using joint input scheme that feeds both original and scribbled source images with different colors and shared encodings for precise localization.

Result: DreamOmni3 achieves outstanding performance on comprehensive benchmarks for scribble-based editing and generation tasks, demonstrating superior capability in handling complex edits involving multiple scribbles, images, and instructions.

Conclusion: The proposed approach enables more flexible and precise multimodal creation by overcoming limitations of text-only prompts and binary masks, establishing new benchmarks for scribble-based editing/generation research with publicly released models and code.

Abstract: Recently unified generation and editing models have achieved remarkable success with their impressive performance. These models rely mainly on text prompts for instruction-based editing and generation, but language often fails to capture users intended edit locations and fine-grained visual details. To this end, we propose two tasks: scribble-based editing and generation, that enables more flexible creation on graphical user interface (GUI) combining user textual, images, and freehand sketches. We introduce DreamOmni3, tackling two challenges: data creation and framework design. Our data synthesis pipeline includes two parts: scribble-based editing and generation. For scribble-based editing, we define four tasks: scribble and instruction-based editing, scribble and multimodal instruction-based editing, image fusion, and doodle editing. Based on DreamOmni2 dataset, we extract editable regions and overlay hand-drawn boxes, circles, doodles or cropped image to construct training data. For scribble-based generation, we define three tasks: scribble and instruction-based generation, scribble and multimodal instruction-based generation, and doodle generation, following similar data creation pipelines. For the framework, instead of using binary masks, which struggle with complex edits involving multiple scribbles, images, and instructions, we propose a joint input scheme that feeds both the original and scribbled source images into the model, using different colors to distinguish regions and simplify processing. By applying the same index and position encodings to both images, the model can precisely localize scribbled regions while maintaining accurate editing. Finally, we establish comprehensive benchmarks for these tasks to promote further research. Experimental results demonstrate that DreamOmni3 achieves outstanding performance, and models and code will be publicly released.

[174] CoAgent: Collaborative Planning and Consistency Agent for Coherent Video Generation

Qinglin Zeng, Kaitong Cai, Ruiqi Chen, Qinhan Lv, Keze Wang

Main category: cs.CV

TL;DR: CoAgent: A collaborative closed-loop framework for coherent long-form video generation that addresses identity drift and scene inconsistency through plan-synthesize-verify pipeline with entity memory and consistency verification.

Details

Motivation: Existing text-to-video models treat shots independently, causing identity drift, scene inconsistency, and unstable temporal structure in open-domain video generation. There's a need for better narrative coherence and visual consistency.

Method: Plan-synthesize-verify pipeline: Storyboard Planner decomposes prompts into shot-level plans; Global Context Manager maintains entity memory; Synthesis Module generates shots with Visual Consistency Controller; Verifier Agent detects inconsistencies; pacing-aware editor refines temporal rhythm.

Result: Extensive experiments show CoAgent significantly improves coherence, visual consistency, and narrative quality in long-form video generation compared to existing approaches.

Conclusion: CoAgent effectively addresses coherence challenges in open-domain video generation through collaborative closed-loop framework with explicit planning, entity memory, and verification mechanisms.

Abstract: Maintaining narrative coherence and visual consistency remains a central challenge in open-domain video generation. Existing text-to-video models often treat each shot independently, resulting in identity drift, scene inconsistency, and unstable temporal structure. We propose CoAgent, a collaborative and closed-loop framework for coherent video generation that formulates the process as a plan-synthesize-verify pipeline. Given a user prompt, style reference, and pacing constraints, a Storyboard Planner decomposes the input into structured shot-level plans with explicit entities, spatial relations, and temporal cues. A Global Context Manager maintains entity-level memory to preserve appearance and identity consistency across shots. Each shot is then generated by a Synthesis Module under the guidance of a Visual Consistency Controller, while a Verifier Agent evaluates intermediate results using vision-language reasoning and triggers selective regeneration when inconsistencies are detected. Finally, a pacing-aware editor refines temporal rhythm and transitions to match the desired narrative flow. Extensive experiments demonstrate that CoAgent significantly improves coherence, visual consistency, and narrative quality in long-form video generation.

[175] Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains

Jesen Zhang, Ningyuan Liu, Kaitong Cai, Sidi Liu, Jing Yang, Ziliang Chen, Xiaofei Sun, Keze Wang

Main category: cs.CV

TL;DR: SR-MCR is a lightweight, label-free framework that improves multimodal LLM reasoning reliability by using self-referential cues from model outputs to align intermediate reasoning steps, achieving state-of-the-art performance on visual benchmarks.

Details

Motivation: Multimodal LLMs often produce fluent but unreliable reasoning with weak step-to-step coherence and insufficient visual grounding, because existing alignment approaches only supervise final answers while ignoring intermediate reasoning reliability.

Method: SR-MCR uses five self-referential cues (semantic alignment, lexical fidelity, non-redundancy, visual grounding, step consistency) to create a normalized reliability-weighted reward for process-level guidance. It employs a critic-free GRPO objective with confidence-aware cooling mechanism to stabilize training and suppress trivial/overconfident generations.

Result: Built on Qwen2.5-VL, SR-MCR improves both answer accuracy and reasoning coherence across visual benchmarks. SR-MCR-7B achieves state-of-the-art performance among open-source models of comparable size with 81.4% average accuracy. Ablation studies confirm contributions of each reward term and cooling module.

Conclusion: SR-MCR provides an effective, lightweight approach to align multimodal reasoning by exploiting intrinsic process signals, addressing the reliability gap in intermediate reasoning while maintaining strong final answer performance.

Abstract: Multimodal LLMs often produce fluent yet unreliable reasoning, exhibiting weak step-to-step coherence and insufficient visual grounding, largely because existing alignment approaches supervise only the final answer while ignoring the reliability of the intermediate reasoning process. We introduce SR-MCR, a lightweight and label-free framework that aligns reasoning by exploiting intrinsic process signals derived directly from model outputs. Five self-referential cues – semantic alignment, lexical fidelity, non-redundancy, visual grounding, and step consistency – are integrated into a normalized, reliability-weighted reward that provides fine-grained process-level guidance. A critic-free GRPO objective, enhanced with a confidence-aware cooling mechanism, further stabilizes training and suppresses trivial or overly confident generations. Built on Qwen2.5-VL, SR-MCR improves both answer accuracy and reasoning coherence across a broad set of visual benchmarks; among open-source models of comparable size, SR-MCR-7B achieves state-of-the-art performance with an average accuracy of 81.4%. Ablation studies confirm the independent contributions of each reward term and the cooling module.

[176] ReFRM3D: A Radiomics-enhanced Fused Residual Multiparametric 3D Network with Multi-Scale Feature Fusion for Glioma Characterization

Md. Abdur Rahman, Mohaimenul Azam Khan Raiaan, Arefin Ittesafun Abian, Yan Zhang, Mirjam Jonkman, Sami Azam

Main category: cs.CV

TL;DR: Proposed ReFRM3D network with radiomics-enhanced features for brain tumor segmentation and classification, achieving state-of-the-art performance on BraTS datasets.

Details

Motivation: Address challenges in glioma diagnosis including high variability in imaging data, inadequate computational optimization, and inefficient segmentation/classification of gliomas.

Method: Developed radiomics-enhanced fused residual multiparametric 3D network (ReFRM3D) based on 3D U-Net with multi-scale feature fusion, hybrid upsampling, and extended residual skip mechanism, plus multi-feature tumor marker classifier using radiomic features.

Result: Achieved high Dice Similarity Coefficients: BraTS2019 - 94.04% WT, 92.68% ET, 93.64% TC; BraTS2020 - 94.09% WT, 92.91% ET, 93.84% TC; BraTS2021 - 93.70% WT, 90.36% ET, 92.13% TC.

Conclusion: The proposed ReFRM3D network with radiomics enhancement significantly improves glioma segmentation and classification performance, addressing key challenges in brain tumor characterization.

Abstract: Gliomas are among the most aggressive cancers, characterized by high mortality rates and complex diagnostic processes. Existing studies on glioma diagnosis and classification often describe issues such as high variability in imaging data, inadequate optimization of computational resources, and inefficient segmentation and classification of gliomas. To address these challenges, we propose novel techniques utilizing multi-parametric MRI data to enhance tumor segmentation and classification efficiency. Our work introduces the first-ever radiomics-enhanced fused residual multiparametric 3D network (ReFRM3D) for brain tumor characterization, which is based on a 3D U-Net architecture and features multi-scale feature fusion, hybrid upsampling, and an extended residual skip mechanism. Additionally, we propose a multi-feature tumor marker-based classifier that leverages radiomic features extracted from the segmented regions. Experimental results demonstrate significant improvements in segmentation performance across the BraTS2019, BraTS2020, and BraTS2021 datasets, achieving high Dice Similarity Coefficients (DSC) of 94.04%, 92.68%, and 93.64% for whole tumor (WT), enhancing tumor (ET), and tumor core (TC) respectively in BraTS2019; 94.09%, 92.91%, and 93.84% in BraTS2020; and 93.70%, 90.36%, and 92.13% in BraTS2021.

[177] KV-Tracker: Real-Time Pose Tracking with Transformers

Marwan Taher, Ignacio Alzugaray, Kirill Mazur, Xin Kong, Andrew J. Davison

Main category: cs.CV

TL;DR: KV-Tracker enables real-time 6-DoF pose tracking and online reconstruction from monocular RGB videos by caching key-value pairs from multi-view geometry networks, achieving 15× speedup and ~27 FPS.

Details

Motivation: Multi-view 3D geometry networks provide strong priors but are too slow for real-time applications. There's a need to adapt these powerful models for online use in pose tracking and reconstruction without sacrificing accuracy.

Method: The method selects and manages keyframes for scene mapping using π³ with bidirectional attention. It caches the global self-attention block’s key-value (KV) pairs and uses them as the sole scene representation for online tracking. This caching strategy is model-agnostic and doesn’t require retraining.

Result: Achieves up to 15× speedup during inference while avoiding drift or catastrophic forgetting. Maintains high frame-rates up to ~27 FPS. Demonstrates strong performance on TUM RGB-D, 7-Scenes, Arctic and OnePose datasets for both scene-level and object tracking/reconstruction without depth or object priors.

Conclusion: KV-Tracker successfully adapts powerful but slow multi-view geometry networks for real-time applications through KV caching, enabling efficient online pose tracking and reconstruction from monocular RGB videos without compromising accuracy.

Abstract: Multi-view 3D geometry networks offer a powerful prior but are prohibitively slow for real-time applications. We propose a novel way to adapt them for online use, enabling real-time 6-DoF pose tracking and online reconstruction of objects and scenes from monocular RGB videos. Our method rapidly selects and manages a set of images as keyframes to map a scene or object via $π^3$ with full bidirectional attention. We then cache the global self-attention block’s key-value (KV) pairs and use them as the sole scene representation for online tracking. This allows for up to $15\times$ speedup during inference without the fear of drift or catastrophic forgetting. Our caching strategy is model-agnostic and can be applied to other off-the-shelf multi-view networks without retraining. We demonstrate KV-Tracker on both scene-level tracking and the more challenging task of on-the-fly object tracking and reconstruction without depth measurements or object priors. Experiments on the TUM RGB-D, 7-Scenes, Arctic and OnePose datasets show the strong performance of our system while maintaining high frame-rates up to ${\sim}27$ FPS.

[178] PTalker: Personalized Speech-Driven 3D Talking Head Animation via Style Disentanglement and Modality Alignment

Bin Wang, Yang Xu, Huan Zhao, Hao Zhang, Zixing Zhang

Main category: cs.CV

TL;DR: PTalker is a framework for personalized 3D talking head generation that preserves individual speaking styles through style-content disentanglement and improves lip-sync accuracy via three-level audio-mesh alignment.

Details

Motivation: Existing speech-driven 3D talking head methods focus on lip synchronization but overlook individual speaking style nuances, limiting personalization and realism in generated animations.

Method: Style-content disentanglement from audio and facial motion sequences using disentanglement constraints, plus three-level alignment: spatial (Graph Attention Networks for mesh structure), temporal (cross-attention for synchronization), and feature (bidirectional contrastive losses and KL divergence).

Result: PTalker outperforms state-of-the-art methods, generating realistic, stylized 3D talking heads that accurately match identity-specific speaking styles, as demonstrated through extensive experiments on public datasets.

Conclusion: The framework successfully addresses the limitation of existing methods by preserving speaking styles while maintaining high lip-synchronization accuracy, enabling more personalized and realistic 3D talking head animation.

Abstract: Speech-driven 3D talking head generation aims to produce lifelike facial animations precisely synchronized with speech. While considerable progress has been made in achieving high lip-synchronization accuracy, existing methods largely overlook the intricate nuances of individual speaking styles, which limits personalization and realism. In this work, we present a novel framework for personalized 3D talking head animation, namely “PTalker”. This framework preserves speaking style through style disentanglement from audio and facial motion sequences and enhances lip-synchronization accuracy through a three-level alignment mechanism between audio and mesh modalities. Specifically, to effectively disentangle style and content, we design disentanglement constraints that encode driven audio and motion sequences into distinct style and content spaces to enhance speaking style representation. To improve lip-synchronization accuracy, we adopt a modality alignment mechanism incorporating three aspects: spatial alignment using Graph Attention Networks to capture vertex connectivity in the 3D mesh structure, temporal alignment using cross-attention to capture and synchronize temporal dependencies, and feature alignment by top-k bidirectional contrastive losses and KL divergence constraints to ensure consistency between speech and mesh modalities. Extensive qualitative and quantitative experiments on public datasets demonstrate that PTalker effectively generates realistic, stylized 3D talking heads that accurately match identity-specific speaking styles, outperforming state-of-the-art methods. The source code and supplementary videos are available at: PTalker.

[179] Enhancing Noise Resilience in Face Clustering via Sparse Differential Transformer

Dafeng Zhang, Yongqi Song, Shizhuo Liu

Main category: cs.CV

TL;DR: Proposes a Sparse Differential Transformer (SDT) with prediction-driven Top-K Jaccard similarity for more accurate face clustering by enhancing neighbor purity and reducing noise in similarity measurements.

Details

Motivation: Existing face clustering methods using Jaccard similarity coefficients introduce too many irrelevant nodes, resulting in limited discriminative power and suboptimal clustering performance. Current approaches also struggle with accurately predicting the optimal number of neighbors (Top-K) and vanilla Transformers introduce noise by overemphasizing irrelevant feature relationships.

Method: 1) Prediction-driven Top-K Jaccard similarity coefficient to enhance neighbor purity; 2) Transformer-based prediction model to examine relationships between central node and neighbors near Top-K; 3) Sparse Differential Transformer (SDT) instead of vanilla Transformer to eliminate noise and enhance anti-noise capabilities.

Result: Extensive experiments on multiple datasets including MS-Celeb-1M demonstrate state-of-the-art (SOTA) performance, outperforming existing methods and providing a more robust solution for face clustering.

Conclusion: The proposed SDT with prediction-driven Top-K Jaccard similarity effectively addresses limitations of existing face clustering methods by improving neighbor purity, enhancing similarity measurement reliability, and reducing noise, resulting in superior clustering performance.

Abstract: The method used to measure relationships between face embeddings plays a crucial role in determining the performance of face clustering. Existing methods employ the Jaccard similarity coefficient instead of the cosine distance to enhance the measurement accuracy. However, these methods introduce too many irrelevant nodes, producing Jaccard coefficients with limited discriminative power and adversely affecting clustering performance. To address this issue, we propose a prediction-driven Top-K Jaccard similarity coefficient that enhances the purity of neighboring nodes, thereby improving the reliability of similarity measurements. Nevertheless, accurately predicting the optimal number of neighbors (Top-K) remains challenging, leading to suboptimal clustering results. To overcome this limitation, we develop a Transformer-based prediction model that examines the relationships between the central node and its neighboring nodes near the Top-K to further enhance the reliability of similarity estimation. However, vanilla Transformer, when applied to predict relationships between nodes, often introduces noise due to their overemphasis on irrelevant feature relationships. To address these challenges, we propose a Sparse Differential Transformer (SDT), instead of the vanilla Transformer, to eliminate noise and enhance the model’s anti-noise capabilities. Extensive experiments on multiple datasets, such as MS-Celeb-1M, demonstrate that our approach achieves state-of-the-art (SOTA) performance, outperforming existing methods and providing a more robust solution for face clustering.

[180] Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

Jiacheng Ye, Shansan Gong, Jiahui Gao, Junming Fan, Shuang Wu, Wei Bi, Haoli Bai, Lifeng Shang, Lingpeng Kong

Main category: cs.CV

TL;DR: Dream-VL is a diffusion-based vision-language model that outperforms autoregressive VLMs on visual planning tasks, and Dream-VLA extends this to vision-language-action tasks with superior performance on robotic benchmarks.

Details

Motivation: Autoregressive VLMs have limitations in complex visual planning and robotic control due to sequential generation. The authors investigate diffusion-based LLMs as a foundation for VLMs to overcome these limitations.

Method: Develop Dream-VL as an open diffusion-based VLM, then extend it to Dream-VLA through continuous pre-training on open robotic datasets. Leverages the bidirectional nature of diffusion models for action chunking and parallel generation.

Result: Dream-VL achieves SOTA among dVLMs and comparable performance to top AR-based VLMs. Dream-VLA achieves 97.2% on LIBERO, 71.4% on SimplerEnv-Bridge, and 60.5% on SimplerEnv-Fractal, surpassing leading models like π₀ and GR00T-N1.

Conclusion: Diffusion-based VLMs/VLAs offer superior potential for visual planning and robotic control tasks compared to autoregressive models, with faster convergence and better performance on complex benchmarks.

Abstract: While autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the potential of constructing Vision-Language Models upon diffusion-based large language models (dLLMs) to overcome these limitations. We introduce Dream-VL, an open diffusion-based VLM (dVLM) that achieves state-of-the-art performance among previous dVLMs. Dream-VL is comparable to top-tier AR-based VLMs trained on open data on various benchmarks but exhibits superior potential when applied to visual planning tasks. Building upon Dream-VL, we introduce Dream-VLA, a dLLM-based Vision-Language-Action model (dVLA) developed through continuous pre-training on open robotic datasets. We demonstrate that the natively bidirectional nature of this diffusion backbone serves as a superior foundation for VLA tasks, inherently suited for action chunking and parallel generation, leading to significantly faster convergence in downstream fine-tuning. Dream-VLA achieves top-tier performance of 97.2% average success rate on LIBERO, 71.4% overall average on SimplerEnv-Bridge, and 60.5% overall average on SimplerEnv-Fractal, surpassing leading models such as $π_0$ and GR00T-N1. We also validate that dVLMs surpass AR baselines on downstream tasks across different training objectives. We release both Dream-VL and Dream-VLA to facilitate further research in the community.

[181] Rethinking Memory Design in SAM-Based Visual Object Tracking

Mohamad Alansari, Muzammal Naseer, Hasan Al Marzouqi, Naoufel Werghi, Sajid Javed

Main category: cs.CV

TL;DR: Systematic memory study for SAM-based visual object tracking, proposing unified hybrid memory framework that improves robustness across SAM2 and SAM3 backbones.

Details

Motivation: Existing SAM-based trackers address memory limitations in method-specific ways without understanding broader design principles, and it's unclear how memory mechanisms transfer to next-generation models like SAM3.

Method: Analyzed representative SAM2-based trackers, reimplemented memory mechanisms in SAM3 framework, conducted large-scale evaluations across 10 benchmarks, and proposed unified hybrid memory framework with short-term appearance memory and long-term distractor-resolving memory.

Result: The proposed framework consistently improves robustness under long-term occlusion, complex motion, and distractor-heavy scenarios on both SAM2 and SAM3 backbones.

Conclusion: Memory design principles are crucial for SAM-based tracking, and the proposed hybrid memory framework provides a modular, principled approach that works across different foundation model generations.

Abstract: \noindent Memory has become the central mechanism enabling robust visual object tracking in modern segmentation-based frameworks. Recent methods built upon Segment Anything Model 2 (SAM2) have demonstrated strong performance by refining how past observations are stored and reused. However, existing approaches address memory limitations in a method-specific manner, leaving the broader design principles of memory in SAM-based tracking poorly understood. Moreover, it remains unclear how these memory mechanisms transfer to stronger, next-generation foundation models such as Segment Anything Model 3 (SAM3). In this work, we present a systematic memory-centric study of SAM-based visual object tracking. We first analyze representative SAM2-based trackers and show that most methods primarily differ in how short-term memory frames are selected, while sharing a common object-centric representation. Building on this insight, we faithfully reimplement these memory mechanisms within the SAM3 framework and conduct large-scale evaluations across ten diverse benchmarks, enabling a controlled analysis of memory design independent of backbone strength. Guided by our empirical findings, we propose a unified hybrid memory framework that explicitly decomposes memory into short-term appearance memory and long-term distractor-resolving memory. This decomposition enables the integration of existing memory policies in a modular and principled manner. Extensive experiments demonstrate that the proposed framework consistently improves robustness under long-term occlusion, complex motion, and distractor-heavy scenarios on both SAM2 and SAM3 backbones. Code is available at: https://github.com/HamadYA/SAM3_Tracking_Zoo. \textbf{This is a preprint. Some results are being finalized and may be updated in a future revision.}

[182] Envision: Embodied Visual Planning via Goal-Imagery Video Diffusion

Yuming Gu, Yizhi Wang, Yining Hong, Yipeng Gao, Hao Jiang, Angtian Wang, Bo Liu, Nathaniel S. Dennler, Zhengfei Kuang, Hao Li, Gordon Wetzstein, Chongyang Ma

Main category: cs.CV

TL;DR: Envision is a diffusion-based visual planning framework that generates goal-aligned video trajectories for embodied agents by first synthesizing goal images and then interpolating between start and goal states.

Details

Motivation: Existing video diffusion models for embodied visual planning are forward predictive without explicit goal modeling, leading to spatial drift and goal misalignment. There's a need for goal-constrained trajectory generation that ensures physical plausibility and goal consistency.

Method: Two-stage approach: 1) Goal Imagery Model identifies task-relevant regions, performs region-aware cross attention, and synthesizes coherent goal images; 2) Env-Goal Video Model (based on FL2V - first-and-last-frame-conditioned video diffusion) interpolates between initial observation and goal image to produce smooth, physically plausible video trajectories.

Result: Superior goal alignment, spatial consistency, and object preservation compared to baselines on object manipulation and image editing benchmarks. Generated visual plans can directly support downstream robotic planning and control.

Conclusion: Envision effectively addresses goal misalignment in visual planning by explicitly constraining generation with goal images, producing reliable trajectories for embodied agents through a two-stage diffusion framework.

Abstract: Embodied visual planning aims to enable manipulation tasks by imagining how a scene evolves toward a desired goal and using the imagined trajectories to guide actions. Video diffusion models, through their image-to-video generation capability, provide a promising foundation for such visual imagination. However, existing approaches are largely forward predictive, generating trajectories conditioned on the initial observation without explicit goal modeling, thus often leading to spatial drift and goal misalignment. To address these challenges, we propose Envision, a diffusion-based framework that performs visual planning for embodied agents. By explicitly constraining the generation with a goal image, our method enforces physical plausibility and goal consistency throughout the generated trajectory. Specifically, Envision operates in two stages. First, a Goal Imagery Model identifies task-relevant regions, performs region-aware cross attention between the scene and the instruction, and synthesizes a coherent goal image that captures the desired outcome. Then, an Env-Goal Video Model, built upon a first-and-last-frame-conditioned video diffusion model (FL2V), interpolates between the initial observation and the goal image, producing smooth and physically plausible video trajectories that connect the start and goal states. Experiments on object manipulation and image editing benchmarks demonstrate that Envision achieves superior goal alignment, spatial consistency, and object preservation compared to baselines. The resulting visual plans can directly support downstream robotic planning and control, providing reliable guidance for embodied agents.

[183] FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution

Yidi Liu, Zihao Fan, Jie Huang, Jie Xiao, Dong Li, Wenlong Zhang, Lei Bai, Xueyang Fu, Zheng-Jun Zha

Main category: cs.CV

TL;DR: Proposes FinPercep-RM, a fine-grained perceptual reward model for RLHF in image super-resolution, with co-evolutionary curriculum learning to address training instability and reward hacking.

Details

Motivation: Traditional IQA models output single global scores insensitive to local distortions, allowing ISR models to produce undesirable artifacts that get spurious high scores (reward hacking), misaligning optimization with perceptual quality.

Method: 1) FinPercep-RM: Encoder-Decoder architecture providing both global quality score and Perceptual Degradation Map for spatial localization of defects. 2) FGR-30k dataset with diverse distortions for training. 3) Co-evolutionary Curriculum Learning (CCL): synchronized curricula where reward model complexity increases while ISR model starts with simple global reward, gradually transitioning to complex outputs.

Result: Experiments validate effectiveness across ISR models in both global quality and local realism on RLHF methods, addressing reward hacking and training instability.

Conclusion: The proposed FinPercep-RM with CCL mechanism enables stable RLHF training for ISR while suppressing reward hacking, improving both global quality and local realism through fine-grained perceptual assessment.

Abstract: Reinforcement Learning with Human Feedback (RLHF) has proven effective in image generation field guided by reward models to align human preferences. Motivated by this, adapting RLHF for Image Super-Resolution (ISR) tasks has shown promise in optimizing perceptual quality with Image Quality Assessment (IQA) model as reward models. However, the traditional IQA model usually output a single global score, which are exceptionally insensitive to local and fine-grained distortions. This insensitivity allows ISR models to produce perceptually undesirable artifacts that yield spurious high scores, misaligning optimization objectives with perceptual quality and results in reward hacking. To address this, we propose a Fine-grained Perceptual Reward Model (FinPercep-RM) based on an Encoder-Decoder architecture. While providing a global quality score, it also generates a Perceptual Degradation Map that spatially localizes and quantifies local defects. We specifically introduce the FGR-30k dataset to train this model, consisting of diverse and subtle distortions from real-world super-resolution models. Despite the success of the FinPercep-RM model, its complexity introduces significant challenges in generator policy learning, leading to training instability. To address this, we propose a Co-evolutionary Curriculum Learning (CCL) mechanism, where both the reward model and the ISR model undergo synchronized curricula. The reward model progressively increases in complexity, while the ISR model starts with a simpler global reward for rapid convergence, gradually transitioning to the more complex model outputs. This easy-to-hard strategy enables stable training while suppressing reward hacking. Experiments validates the effectiveness of our method across ISR models in both global quality and local realism on RLHF methods.

[184] Visual Autoregressive Modelling for Monocular Depth Estimation

Amir El-Ghoussani, André Kaup, Nassir Navab, Gustavo Carneiro, Vasileios Belagiannis

Main category: cs.CV

TL;DR: Monocular depth estimation using visual autoregressive (VAR) priors as an alternative to diffusion models, achieving competitive results with minimal fine-tuning data.

Details

Motivation: To provide an alternative to diffusion-based approaches for monocular depth estimation by leveraging autoregressive priors, which offer advantages in data scalability and adaptability to 3D vision tasks.

Method: Adapts a large-scale text-to-image VAR model, introduces scale-wise conditional upsampling with classifier-free guidance, performs inference in ten fixed autoregressive stages, and requires only 74K synthetic samples for fine-tuning.

Result: Achieves state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance on outdoor datasets, establishing autoregressive priors as complementary geometry-aware generative models.

Conclusion: Autoregressive priors represent a viable alternative to diffusion models for depth estimation, offering benefits in data efficiency and adaptability to 3D vision applications.

Abstract: We propose a monocular depth estimation method based on visual autoregressive (VAR) priors, offering an alternative to diffusion-based approaches. Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism with classifier-free guidance. Our approach performs inference in ten fixed autoregressive stages, requiring only 74K synthetic samples for fine-tuning, and achieves competitive results. We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets. This work establishes autoregressive priors as a complementary family of geometry-aware generative models for depth estimation, highlighting advantages in data scalability, and adaptability to 3D vision tasks. Code available at “https://github.com/AmirMaEl/VAR-Depth".

[185] Investigating Deep Learning Models for Ejection Fraction Estimation from Echocardiography Videos

Shravan Saranyan, Pramit Saha

Main category: cs.CV

TL;DR: Deep learning models for automated LVEF estimation from echocardiography videos achieve RMSE of 6.79% with 3D Inception architectures performing best, though overfitting remains a challenge.

Details

Motivation: Manual assessment of LVEF from echocardiograms is time-consuming and has high inter-observer variability, creating a need for automated deep learning solutions that can match expert performance.

Method: Systematic evaluation of multiple deep learning architectures (3D Inception, two-stream, CNN-RNN models) on EchoNet-Dynamic dataset (10,030 videos), with architectural modifications and fusion strategies to optimize LVEF prediction accuracy.

Result: Modified 3D Inception architectures achieved best performance with RMSE of 6.79%. Smaller/simpler models showed better generalization due to overfitting tendencies. Performance was highly sensitive to hyperparameters like kernel sizes and normalization strategies.

Conclusion: Deep learning can effectively automate LVEF estimation from echocardiography, with architectural insights applicable to broader video analysis tasks, though careful design is needed to address overfitting and hyperparameter sensitivity.

Abstract: Left ventricular ejection fraction (LVEF) is a key indicator of cardiac function and plays a central role in the diagnosis and management of cardiovascular disease. Echocardiography, as a readily accessible and non-invasive imaging modality, is widely used in clinical practice to estimate LVEF. However, manual assessment of cardiac function from echocardiograms is time-consuming and subject to considerable inter-observer variability. Deep learning approaches offer a promising alternative, with the potential to achieve performance comparable to that of experienced human experts. In this study, we investigate the effectiveness of several deep learning architectures for LVEF estimation from echocardiography videos, including 3D Inception, two-stream, and CNN-RNN models. We systematically evaluate architectural modifications and fusion strategies to identify configurations that maximize prediction accuracy. Models were trained and evaluated on the EchoNet-Dynamic dataset, comprising 10,030 echocardiogram videos. Our results demonstrate that modified 3D Inception architectures achieve the best overall performance, with a root mean squared error (RMSE) of 6.79%. Across architectures, we observe a tendency toward overfitting, with smaller and simpler models generally exhibiting improved generalization. Model performance was also found to be highly sensitive to hyperparameter choices, particularly convolutional kernel sizes and normalization strategies. While this study focuses on echocardiography-based LVEF estimation, the insights gained regarding architectural design and training strategies may be applicable to a broader range of medical and non-medical video analysis tasks.

[186] Unleashing Foundation Vision Models: Adaptive Transfer for Diverse Data-Limited Scientific Domains

Qiankun Li, Feng He, Huabao Chen, Xin Ning, Kun Wang, Zengfu Wang

Main category: cs.CV

TL;DR: CLAdapter is a novel Cluster Attention Adapter that adapts rich pre-trained vision model representations to data-limited downstream scientific tasks through attention mechanisms and cluster centers.

Details

Motivation: While large-scale datasets have enabled powerful pre-trained vision models, many specialized scientific domains face data scarcity challenges that hinder effective transfer learning from these foundation models to downstream tasks.

Method: CLAdapter introduces attention mechanisms and cluster centers to personalize feature enhancement through distribution correlation and transformation matrices, enabling models to learn distinct representations for different feature sets. It features a unified interface compatible with both CNNs and Transformers in 2D and 3D contexts.

Result: Extensive experiments on 10 datasets across diverse domains (generic, biological, medical, industrial, agricultural, environmental, geographical, materials science, OOD, and 3D) show CLAdapter achieves state-of-the-art performance in data-limited scientific domains.

Conclusion: CLAdapter effectively unleashes the potential of foundation vision models through adaptive transfer, demonstrating superior performance across diverse data-limited scientific applications with a unified architecture.

Abstract: In the big data era, the computer vision field benefits from large-scale datasets such as LAION-2B, LAION-400M, and ImageNet-21K, Kinetics, on which popular models like the ViT and ConvNeXt series have been pre-trained, acquiring substantial knowledge. However, numerous downstream tasks in specialized and data-limited scientific domains continue to pose significant challenges. In this paper, we propose a novel Cluster Attention Adapter (CLAdapter), which refines and adapts the rich representations learned from large-scale data to various data-limited downstream tasks. Specifically, CLAdapter introduces attention mechanisms and cluster centers to personalize the enhancement of transformed features through distribution correlation and transformation matrices. This enables models fine-tuned with CLAdapter to learn distinct representations tailored to different feature sets, facilitating the models’ adaptation from rich pre-trained features to various downstream scenarios effectively. In addition, CLAdapter’s unified interface design allows for seamless integration with multiple model architectures, including CNNs and Transformers, in both 2D and 3D contexts. Through extensive experiments on 10 datasets spanning domains such as generic, multimedia, biological, medical, industrial, agricultural, environmental, geographical, materials science, out-of-distribution (OOD), and 3D analysis, CLAdapter achieves state-of-the-art performance across diverse data-limited scientific domains, demonstrating its effectiveness in unleashing the potential of foundation vision models via adaptive transfer. Code is available at https://github.com/qklee-lz/CLAdapter.

[187] INTERACT-CMIL: Multi-Task Shared Learning and Inter-Task Consistency for Conjunctival Melanocytic Intraepithelial Lesion Grading

Mert Ikinci, Luna Toma, Karin U. Loeffler, Leticia Ussem, Daniela Süsskind, Julia M. Weller, Yousef Yeganeh, Martina C. Herwig-Carl, Shadi Albarqouni

Main category: cs.CV

TL;DR: INTERACT-CMIL is a multi-head deep learning framework that jointly predicts five histopathological axes for Conjunctival Melanocytic Intraepithelial Lesions, achieving significant improvements over baseline models through shared feature learning and cross-task consistency enforcement.

Details

Motivation: Accurate grading of CMIL is crucial for treatment and melanoma prediction but remains challenging due to subtle morphological cues and interrelated diagnostic criteria, highlighting the need for standardized computational approaches.

Method: A multi-head deep learning framework with Shared Feature Learning, Combinatorial Partial Supervision, and Inter-Dependence Loss to enforce cross-task consistency when predicting five histopathological axes: WHO4, WHO5, horizontal spread, vertical spread, and cytologic atypia.

Result: Trained on 486 expert-annotated conjunctival biopsy patches from three university hospitals, INTERACT-CMIL achieves relative macro F1 gains up to 55.1% (WHO4) and 25.0% (vertical spread) over CNN and foundation-model baselines.

Conclusion: The framework provides coherent, interpretable multi-criteria predictions aligned with expert grading, offering a reproducible computational benchmark for CMIL diagnosis and advancing toward standardized digital ocular pathology.

Abstract: Accurate grading of Conjunctival Melanocytic Intraepithelial Lesions (CMIL) is essential for treatment and melanoma prediction but remains difficult due to subtle morphological cues and interrelated diagnostic criteria. We introduce INTERACT-CMIL, a multi-head deep learning framework that jointly predicts five histopathological axes; WHO4, WHO5, horizontal spread, vertical spread, and cytologic atypia, through Shared Feature Learning with Combinatorial Partial Supervision and an Inter-Dependence Loss enforcing cross-task consistency. Trained and evaluated on a newly curated, multi-center dataset of 486 expert-annotated conjunctival biopsy patches from three university hospitals, INTERACT-CMIL achieves consistent improvements over CNN and foundation-model (FM) baselines, with relative macro F1 gains up to 55.1% (WHO4) and 25.0% (vertical spread). The framework provides coherent, interpretable multi-criteria predictions aligned with expert grading, offering a reproducible computational benchmark for CMIL diagnosis and a step toward standardized digital ocular pathology.

[188] CritiFusion: Semantic Critique and Spectral Alignment for Faithful Text-to-Image Generation

ZhenQi Chen, TsaiChing Ni, YuanFu Yang

Main category: cs.CV

TL;DR: CritiFusion is an inference-time framework that improves text-to-image diffusion models by combining multimodal semantic critique with frequency-domain refinement, enhancing prompt alignment and visual quality without additional training.

Details

Motivation: Current text-to-image diffusion models achieve high visual fidelity but often fail to maintain semantic alignment with complex prompts, leading to generated images that don't accurately reflect the intended content described in the text.

Method: CritiFusion introduces two key components: 1) CritiCore module that uses vision-language and multiple large language models to provide semantic feedback and enrich prompt context, guiding the diffusion process for better alignment; 2) SpecFusion that merges intermediate generation states in the spectral domain to inject coarse structural information while preserving high-frequency details. The framework operates at inference time without requiring additional model training.

Result: Experiments show CritiFusion significantly improves human-aligned metrics for text-to-image correspondence and visual quality. It consistently boosts human preference scores and aesthetic evaluations, achieving performance comparable to state-of-the-art reward optimization approaches. Qualitative results demonstrate superior detail, realism, and prompt fidelity.

Conclusion: CritiFusion effectively addresses semantic alignment issues in text-to-image generation through its novel combination of semantic critique and spectral alignment strategies, serving as a plug-in refinement stage compatible with existing diffusion backbones.

Abstract: Recent text-to-image diffusion models have achieved remarkable visual fidelity but often struggle with semantic alignment to complex prompts. We introduce CritiFusion, a novel inference-time framework that integrates a multimodal semantic critique mechanism with frequency-domain refinement to improve text-to-image consistency and detail. The proposed CritiCore module leverages a vision-language model and multiple large language models to enrich the prompt context and produce high-level semantic feedback, guiding the diffusion process to better align generated content with the prompt’s intent. Additionally, SpecFusion merges intermediate generation states in the spectral domain, injecting coarse structural information while preserving high-frequency details. No additional model training is required. CritiFusion serves as a plug-in refinement stage compatible with existing diffusion backbones. Experiments on standard benchmarks show that our method notably improves human-aligned metrics of text-to-image correspondence and visual quality. CritiFusion consistently boosts performance on human preference scores and aesthetic evaluations, achieving results on par with state-of-the-art reward optimization approaches. Qualitative results further demonstrate superior detail, realism, and prompt fidelity, indicating the effectiveness of our semantic critique and spectral alignment strategy.

[189] Autoregressive Flow Matching for Motion Prediction

Johnathan Xie, Stefan Stojanov, Cristobal Eyzaguirre, Daniel L. K. Yamins, Jiajun Wu

Main category: cs.CV

TL;DR: ARFM (autoregressive flow matching) is a new probabilistic method for sequential continuous data that predicts future point track locations from diverse video datasets, improving downstream human and robot motion prediction tasks.

Details

Motivation: Existing motion prediction models are trained on narrow distributions, while scaled video prediction models struggle with complex motions despite visual realism. There's a need for models that can accurately predict complex motions across diverse contexts.

Method: Developed autoregressive flow matching (ARFM), a probabilistic modeling approach for sequential continuous data, trained on diverse video datasets to generate future point track locations over long horizons.

Result: ARFM can predict complex motions effectively. Conditioning robot action prediction and human motion prediction on predicted future tracks significantly improves downstream task performance.

Conclusion: ARFM provides an effective approach for motion prediction that bridges the gap between narrow-distribution models and scaled video generation, with demonstrated improvements in both human and robot motion prediction tasks.

Abstract: Motion prediction has been studied in different contexts with models trained on narrow distributions and applied to downstream tasks in human motion prediction and robotics. Simultaneously, recent efforts in scaling video prediction have demonstrated impressive visual realism, yet they struggle to accurately model complex motions despite massive scale. Inspired by the scaling of video generation, we develop autoregressive flow matching (ARFM), a new method for probabilistic modeling of sequential continuous data and train it on diverse video datasets to generate future point track locations over long horizons. To evaluate our model, we develop benchmarks for evaluating the ability of motion prediction models to predict human and robot motion. Our model is able to predict complex motions, and we demonstrate that conditioning robot action prediction and human motion prediction on predicted future tracks can significantly improve downstream task performance. Code and models publicly available at: https://github.com/Johnathan-Xie/arfm-motion-prediction.

[190] Multimodal Diffeomorphic Registration with Neural ODEs and Structural Descriptors

Salvador Rodriguez-Sanz, Monica Hernandez

Main category: cs.CV

TL;DR: A multimodal diffeomorphic registration method using Neural ODEs that integrates structural descriptors and local mutual information for accurate, efficient registration across different imaging modalities without requiring extensive training data.

Details

Motivation: Existing nonrigid registration methods face tradeoffs between accuracy, computational complexity, and regularization. They also rely on intensity correlation, limiting them to monomodal settings. Learning-based models require extensive training data and suffer performance degradation on unseen modalities.

Method: Proposes an instance-specific framework using Neural ODEs with structural descriptors (modality-agnostic metric models exploiting self-similarities on parameterized neighborhood geometries). Three variants integrate image-based or feature-based structural descriptors with nonstructural image similarities computed by local mutual information.

Result: Surpassing qualitative and quantitative results compared to state-of-the-art baselines for both large and small deformations in multimodal registration. Demonstrates robustness to varying regularization levels, suitability for registration at varying scales, and efficiency compared to other large-deformation registration methods.

Conclusion: The proposed Neural ODE-based multimodal diffeomorphic registration framework effectively addresses limitations of existing methods by combining structural descriptors with local mutual information, achieving accurate registration across modalities without extensive training requirements.

Abstract: This work proposes a multimodal diffeomorphic registration method using Neural Ordinary Differential Equations (Neural ODEs). Nonrigid registration algorithms exhibit tradeoffs between their accuracy, the computational complexity of their deformation model, and its proper regularization. In addition, they also assume intensity correlation in anatomically homologous regions of interest among image pairs, limiting their applicability to the monomodal setting. Unlike learning-based models, we propose an instance-specific framework that is not subject to high scan requirements for training and does not suffer performance degradation at inference time on modalities unseen during training. Our method exploits the potential of continuous-depth networks in the Neural ODE paradigm with structural descriptors, widely adopted as modality-agnostic metric models which exploit self-similarities on parameterized neighborhood geometries. We propose three different variants that integrate image-based or feature-based structural descriptors and nonstructural image similarities computed by local mutual information. We conduct extensive evaluations on different experiments formed by scan dataset combinations and show surpassing qualitative and quantitative results compared to state-of-the-art baselines adequate for large or small deformations, and specific of multimodal registration. Lastly, we also demonstrate the underlying robustness of the proposed framework to varying levels of explicit regularization while maintaining low error, its suitability for registration at varying scales, and its efficiency with respect to other methods targeted to large-deformation registration.

[191] SCPainter: A Unified Framework for Realistic 3D Asset Insertion and Novel View Synthesis

Paul Dobre, Jackson Cooper, Xin Wang, Hongzhou Yang

Main category: cs.CV

TL;DR: SCPainter is a unified framework that combines 3D Gaussian Splat car assets with diffusion-based generation to enable realistic 3D asset insertion and novel view synthesis for autonomous driving simulation.

Details

Motivation: Autonomous driving needs diverse training data covering long-tailed scenarios. Current methods treat 3D asset insertion and novel view synthesis separately, lacking realistic integration and interaction capabilities needed for comprehensive simulation.

Method: Unified framework integrating 3D Gaussian Splat car asset representations with 3D scene point clouds and diffusion-based generation. Projects 3D GS assets and point clouds into novel views, then uses these projections to condition a diffusion model for high-quality image generation.

Result: Evaluation on Waymo Open Dataset demonstrates the framework’s capability to enable realistic 3D asset insertion and novel view synthesis, facilitating creation of diverse and realistic driving data.

Conclusion: SCPainter successfully unifies 3D asset insertion and novel view synthesis, enabling more realistic and diverse driving scenario generation for autonomous vehicle training.

Abstract: 3D Asset insertion and novel view synthesis (NVS) are key components for autonomous driving simulation, enhancing the diversity of training data. With better training data that is diverse and covers a wide range of situations, including long-tailed driving scenarios, autonomous driving models can become more robust and safer. This motivates a unified simulation framework that can jointly handle realistic integration of inserted 3D assets and NVS. Recent 3D asset reconstruction methods enable reconstruction of dynamic actors from video, supporting their re-insertion into simulated driving scenes. While the overall structure and appearance can be accurate, it still struggles to capture the realism of 3D assets through lighting or shadows, particularly when inserted into scenes. In parallel, recent advances in NVS methods have demonstrated promising results in synthesizing viewpoints beyond the originally recorded trajectories. However, existing approaches largely treat asset insertion and NVS capabilities in isolation. To allow for interaction with the rest of the scene and to enable more diverse creation of new scenarios for training, realistic 3D asset insertion should be combined with NVS. To address this, we present SCPainter (Street Car Painter), a unified framework which integrates 3D Gaussian Splat (GS) car asset representations and 3D scene point clouds with diffusion-based generation to jointly enable realistic 3D asset insertion and NVS. The 3D GS assets and 3D scene point clouds are projected together into novel views, and these projections are used to condition a diffusion model to generate high quality images. Evaluation on the Waymo Open Dataset demonstrate the capability of our framework to enable 3D asset insertion and NVS, facilitating the creation of diverse and realistic driving data.

[192] Split4D: Decomposed 4D Scene Reconstruction Without Video Segmentation

Yongzhen Hu, Yihui Yang, Haotong Lin, Yifan Wang, Junting Dong, Yifu Deng, Xinyu Zhu, Fan Jia, Hujun Bao, Xiaowei Zhou, Sida Peng

Main category: cs.CV

TL;DR: Freetime FeatureGS enables decomposed 4D scene reconstruction from multi-view videos using Gaussian primitives with learnable features and temporal motion, eliminating reliance on unstable video segmentation by learning from per-image segmentation maps through contrastive loss and streaming feature propagation.

Details

Motivation: Current methods for decomposed 4D scene reconstruction rely heavily on video segmentation results, which are often unstable and lead to unreliable reconstruction. The paper aims to overcome this limitation by developing a method that doesn't require video segmentation.

Method: Proposes Freetime FeatureGS representing dynamic scenes as Gaussian primitives with learnable features and linear motion ability. Uses contrastive loss to align primitive features with 2D segmentation maps, and implements streaming feature learning with temporally ordered sampling for temporal propagation.

Result: Experimental results on several datasets show that the reconstruction quality significantly outperforms recent methods by a large margin.

Conclusion: The method successfully achieves decomposed 4D scene reconstruction without relying on video segmentation, using Gaussian primitives with temporal motion and streaming feature learning to overcome limitations of previous approaches.

Abstract: This paper addresses the problem of decomposed 4D scene reconstruction from multi-view videos. Recent methods achieve this by lifting video segmentation results to a 4D representation through differentiable rendering techniques. Therefore, they heavily rely on the quality of video segmentation maps, which are often unstable, leading to unreliable reconstruction results. To overcome this challenge, our key idea is to represent the decomposed 4D scene with the Freetime FeatureGS and design a streaming feature learning strategy to accurately recover it from per-image segmentation maps, eliminating the need for video segmentation. Freetime FeatureGS models the dynamic scene as a set of Gaussian primitives with learnable features and linear motion ability, allowing them to move to neighboring regions over time. We apply a contrastive loss to Freetime FeatureGS, forcing primitive features to be close or far apart based on whether their projections belong to the same instance in the 2D segmentation map. As our Gaussian primitives can move across time, it naturally extends the feature learning to the temporal dimension, achieving 4D segmentation. Furthermore, we sample observations for training in a temporally ordered manner, enabling the streaming propagation of features over time and effectively avoiding local minima during the optimization process. Experimental results on several datasets show that the reconstruction quality of our method outperforms recent methods by a large margin.

[193] TrimTokenator-LC: Towards Adaptive Visual Token Pruning for Large Multimodal Models with Long Contexts

Hao Zhang, Mengsi Lyu, Bo Huang, Yulong Ao, Yonghua Lin

Main category: cs.CV

TL;DR: The paper introduces an adaptive visual token pruning method for Large Multimodal Models to reduce inference costs in long context, multi-image scenarios by dynamically allocating token budgets based on intra-image diversity and inter-image variation.

Details

Motivation: Large Multimodal Models (LMMs) face high inference costs due to growing numbers of visual tokens, especially in long context settings with multiple images. Existing visual token pruning methods often overlook these multi-image scenarios.

Method: Two-stage adaptive pruning: 1) Intra-image stage allocates content-aware token budgets per image and greedily selects representative tokens; 2) Inter-image stage performs global diversity filtering and Pareto selection balancing diversity with text alignment.

Result: Extensive experiments show the approach maintains strong performance in long context settings while significantly reducing the number of visual tokens, effectively cutting down inference costs.

Conclusion: The proposed adaptive visual token pruning method successfully addresses the challenges of multi-image, long context scenarios in LMMs by decomposing redundancy into intra-image and inter-image components and implementing dynamic budget allocation.

Abstract: Large Multimodal Models (LMMs) have proven effective on various tasks. They typically encode visual inputs into Original Model sequences of tokens, which are then concatenated with textual tokens and jointly processed by the language model. However, the growing number of visual tokens greatly increases inference cost. Visual token pruning has emerged as a promising solution. However, existing methods often overlook scenarios involving long context inputs with multiple images. In this paper, we analyze the challenges of visual token pruning in long context, multi-image settings and introduce an adaptive pruning method tailored for such scenarios. We decompose redundancy into intra-image and inter-image components and quantify them through intra-image diversity and inter-image variation, which jointly guide dynamic budget allocation. Our approach consists of two stages. The intra-image stage allocates each image a content-aware token budget and greedily selects its most representative tokens. The inter-image stage performs global diversity filtering to form a candidate pool and then applies a Pareto selection procedure that balances diversity with text alignment. Extensive experiments show that our approach maintains strong performance in long context settings while significantly cutting down the number of visual tokens.

[194] Neighbor-Aware Token Reduction via Hilbert Curve for Vision Transformers

Yunge Li, Lanyu Xu

Main category: cs.CV

TL;DR: ViT token reduction using Hilbert curve reordering to preserve spatial neighbor relationships via NAP pruning and MAT merging for better accuracy-efficiency trade-off.

Details

Motivation: Vision Transformers have computational inefficiency due to redundant token representations, and existing token reduction methods fail to preserve spatial continuity and neighbor relationships, losing important local context.

Method: Proposes neighbor-aware token reduction using Hilbert curve reordering to preserve 2D spatial neighbor structure in 1D sequential representations. Two key strategies: Neighbor-Aware Pruning (NAP) for selective token retention and Merging by Adjacent Token similarity (MAT) for local token aggregation.

Result: Achieves state-of-the-art accuracy-efficiency trade-offs compared to existing token reduction methods, demonstrating superior performance.

Conclusion: Highlights the importance of spatial continuity and neighbor structure preservation in token reduction, offering new architectural optimization insights for Vision Transformers.

Abstract: Vision Transformers (ViTs) have achieved remarkable success in visual recognition tasks, but redundant token representations limit their computational efficiency. Existing token merging and pruning strategies often overlook spatial continuity and neighbor relationships, resulting in the loss of local context. This paper proposes novel neighbor-aware token reduction methods based on Hilbert curve reordering, which explicitly preserves the neighbor structure in a 2D space using 1D sequential representations. Our method introduces two key strategies: Neighbor-Aware Pruning (NAP) for selective token retention and Merging by Adjacent Token similarity (MAT) for local token aggregation. Experiments demonstrate that our approach achieves state-of-the-art accuracy-efficiency trade-offs compared to existing methods. This work highlights the importance of spatial continuity and neighbor structure, offering new insights for the architectural optimization of ViTs.

[195] Next Best View Selections for Semantic and Dynamic 3D Gaussian Splatting

Yiqian Li, Wen Jiang, Kostas Daniilidis

Main category: cs.CV

TL;DR: Active learning approach for view selection in embodied agents using Fisher Information to prioritize informative frames for both semantic reasoning and dynamic scene modeling.

Details

Motivation: Semantics and dynamics understanding is crucial for embodied agents, but these tasks have significant data redundancy compared to static scene understanding. Current methods lack principled approaches for selecting informative views from multi-camera setups.

Method: Formulates view selection as an active learning problem, using Fisher Information to quantify informativeness of candidate views with respect to semantic Gaussian parameters and deformation networks. This jointly handles semantic reasoning and dynamic scene modeling.

Result: Method consistently improves rendering quality and semantic segmentation performance on large-scale static images and dynamic video datasets, outperforming baseline methods based on random selection and uncertainty-based heuristics.

Conclusion: The proposed active learning approach with Fisher Information provides a principled alternative to heuristic strategies for view selection, effectively addressing both semantic and dynamic scene understanding needs for embodied agents.

Abstract: Understanding semantics and dynamics has been crucial for embodied agents in various tasks. Both tasks have much more data redundancy than the static scene understanding task. We formulate the view selection problem as an active learning problem, where the goal is to prioritize frames that provide the greatest information gain for model training. To this end, we propose an active learning algorithm with Fisher Information that quantifies the informativeness of candidate views with respect to both semantic Gaussian parameters and deformation networks. This formulation allows our method to jointly handle semantic reasoning and dynamic scene modeling, providing a principled alternative to heuristic or random strategies. We evaluate our method on large-scale static images and dynamic video datasets by selecting informative frames from multi-camera setups. Experimental results demonstrate that our approach consistently improves rendering quality and semantic segmentation performance, outperforming baseline methods based on random selection and uncertainty-based heuristics.

[196] Parallel Diffusion Solver via Residual Dirichlet Policy Optimization

Ruoyu Wang, Ziyu Li, Beier Zhu, Liangyu Yuan, Hanwang Zhang, Xun Yang, Xiaojun Chang, Chi Zhang

Main category: cs.CV

TL;DR: EPD-Solver: A novel ODE solver for diffusion models that uses parallel gradient evaluations to reduce truncation errors while maintaining low latency through full parallelization.

Details

Motivation: Diffusion models have high sampling latency due to sequential denoising, and existing acceleration methods suffer from image quality degradation under low-latency budgets due to accumulated truncation errors from high-curvature trajectory segments.

Method: EPD-Solver incorporates multiple parallel gradient evaluations per step using Mean Value Theorem for vector-valued functions, with a two-stage optimization: 1) distillation-based parameter learning, and 2) parameter-efficient RL fine-tuning that treats the solver as a stochastic Dirichlet policy operating only in low-dimensional solver space.

Result: The method reduces truncation errors while preserving low-latency sampling through full parallelization of gradient computations, and can serve as a plugin (EPD-Plugin) to improve existing ODE samplers.

Conclusion: EPD-Solver effectively mitigates quality degradation in accelerated diffusion model sampling by leveraging geometric insights about sampling trajectories and maintaining computational efficiency through parallelization.

Abstract: Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature. Existing solver-based acceleration methods often face significant image quality degradation under a low-latency budget, primarily due to accumulated truncation errors arising from the inability to capture high-curvature trajectory segments. In this paper, we propose the Ensemble Parallel Direction solver (dubbed as EPD-Solver), a novel ODE solver that mitigates these errors by incorporating multiple parallel gradient evaluations in each step. Motivated by the geometric insight that sampling trajectories are largely confined to a low-dimensional manifold, EPD-Solver leverages the Mean Value Theorem for vector-valued functions to approximate the integral solution more accurately. Importantly, since the additional gradient computations are independent, they can be fully parallelized, preserving low-latency sampling nature. We introduce a two-stage optimization framework. Initially, EPD-Solver optimizes a small set of learnable parameters via a distillation-based approach. We further propose a parameter-efficient Reinforcement Learning (RL) fine-tuning scheme that reformulates the solver as a stochastic Dirichlet policy. Unlike traditional methods that fine-tune the massive backbone, our RL approach operates strictly within the low-dimensional solver space, effectively mitigating reward hacking while enhancing performance in complex text-to-image (T2I) generation tasks. In addition, our method is flexible and can serve as a plugin (EPD-Plugin) to improve existing ODE samplers.

[197] VPTracker: Global Vision-Language Tracking via Visual Prompt and MLLM

Jingchao Wang, Kaiwen Zhou, Zhijian Wu, Kunhua Ji, Dingjiang Huang, Yefeng Zheng

Main category: cs.CV

TL;DR: VPTracker: First global tracking framework using Multimodal LLMs for vision-language tracking, addressing local search limitations with location-aware visual prompting.

Details

Motivation: Existing vision-language tracking methods are limited to local search, making them prone to failures under viewpoint changes, occlusions, and rapid target movements. There's a need for more robust tracking that can locate targets across the entire image space.

Method: Proposes VPTracker, a global tracking framework using Multimodal Large Language Models with location-aware visual prompting. Uses region-level prompts based on target’s previous location to prioritize region-level recognition and resort to global inference only when necessary.

Result: Extensive experiments show significant enhancement in tracking stability and target disambiguation under challenging scenarios. The approach effectively suppresses interference from distracting visual content while retaining global tracking advantages.

Conclusion: VPTracker opens a new avenue for integrating MLLMs into visual tracking by combining global search robustness with location-aware prompting to handle distractions, demonstrating improved performance in challenging tracking scenarios.

Abstract: Vision-Language Tracking aims to continuously localize objects described by a visual template and a language description. Existing methods, however, are typically limited to local search, making them prone to failures under viewpoint changes, occlusions, and rapid target movements. In this work, we introduce the first global tracking framework based on Multimodal Large Language Models (VPTracker), exploiting their powerful semantic reasoning to locate targets across the entire image space. While global search improves robustness and reduces drift, it also introduces distractions from visually or semantically similar objects. To address this, we propose a location-aware visual prompting mechanism that incorporates spatial priors into the MLLM. Specifically, we construct a region-level prompt based on the target’s previous location, enabling the model to prioritize region-level recognition and resort to global inference only when necessary. This design retains the advantages of global tracking while effectively suppressing interference from distracting visual content. Extensive experiments show that our approach significantly enhances tracking stability and target disambiguation under challenging scenarios, opening a new avenue for integrating MLLMs into visual tracking. Code is available at https://github.com/jcwang0602/VPTracker.

[198] Medical Scene Reconstruction and Segmentation based on 3D Gaussian Representation

Bin Liu, Wenyan Tian, Huangxin Fu, Zizheng Li, Zhifen He, Bo Li

Main category: cs.CV

TL;DR: Proposed efficient 3D reconstruction method using 3D Gaussian and tri-plane representations for medical images, improving structural continuity and semantic consistency in sparse slice conditions while enhancing reconstruction efficiency.

Details

Motivation: Traditional 3D reconstruction methods for medical images are computationally expensive and suffer from structural discontinuities and detail loss in sparse slices, failing to meet clinical accuracy requirements.

Method: Combines 3D Gaussian representation with tri-plane representations to maintain efficient rendering and geometric representation advantages while enhancing structural continuity and semantic consistency under sparse slicing conditions.

Result: Experimental results on multimodal medical datasets (US and MRI) show the method generates high-quality, anatomically coherent, and semantically stable medical images under sparse data conditions while significantly improving reconstruction efficiency.

Conclusion: Provides an efficient and reliable new approach for 3D visualization and clinical analysis of medical images, addressing limitations of traditional methods in sparse slice reconstruction.

Abstract: 3D reconstruction of medical images is a key technology in medical image analysis and clinical diagnosis, providing structural visualization support for disease assessment and surgical planning. Traditional methods are computationally expensive and prone to structural discontinuities and loss of detail in sparse slices, making it difficult to meet clinical accuracy requirements.To address these challenges, we propose an efficient 3D reconstruction method based on 3D Gaussian and tri-plane representations. This method not only maintains the advantages of Gaussian representation in efficient rendering and geometric representation but also significantly enhances structural continuity and semantic consistency under sparse slicing conditions. Experimental results on multimodal medical datasets such as US and MRI show that our proposed method can generate high-quality, anatomically coherent, and semantically stable medical images under sparse data conditions, while significantly improving reconstruction efficiency. This provides an efficient and reliable new approach for 3D visualization and clinical analysis of medical images.

[199] Evaluating the Performance of Open-Vocabulary Object Detection in Low-quality Image

Po-Chih Wu

Main category: cs.CV

TL;DR: Researchers evaluate open-vocabulary object detection models on low-quality images using a new dataset, finding OWLv2 performs best while other models decline significantly under high degradation.

Details

Motivation: To assess how open-vocabulary object detection models perform under real-world low-quality image conditions, which is important for practical applications where image quality varies.

Method: Created a new dataset simulating real-world low-quality images and evaluated multiple open-vocabulary detection models (OWLv2, OWL-ViT, GroundingDINO, Detic) under different degradation levels.

Result: Models showed no significant mAP decrease under low-level degradation, but all models dropped sharply under high-level degradation. OWLv2 performed consistently better across degradation types, while OWL-ViT, GroundingDINO, and Detic showed significant performance declines.

Conclusion: Open-vocabulary detection models are vulnerable to high-level image degradation, with OWLv2 being most robust. The new dataset will be released to support future research on model robustness to image quality variations.

Abstract: Open-vocabulary object detection enables models to localize and recognize objects beyond a predefined set of categories and is expected to achieve recognition capabilities comparable to human performance. In this study, we aim to evaluate the performance of existing models on open-vocabulary object detection tasks under low-quality image conditions. For this purpose, we introduce a new dataset that simulates low-quality images in the real world. In our evaluation experiment, we find that although open-vocabulary object detection models exhibited no significant decrease in mAP scores under low-level image degradation, the performance of all models dropped sharply under high-level image degradation. OWLv2 models consistently performed better across different types of degradation, while OWL-ViT, GroundingDINO, and Detic showed significant performance declines. We will release our dataset and codes to facilitate future studies.

[200] EgoReAct: Egocentric Video-Driven 3D Human Reaction Generation

Libo Zhang, Zekun Li, Tianyu Li, Zeyu Cao, Rui Xu, Xiaoxiao Long, Wenjia Wang, Jingbo Wang, Yuan Liu, Wenping Wang, Daquan Zhou, Taku Komura, Zhiyang Dou

Main category: cs.CV

TL;DR: EgoReAct: First autoregressive framework for generating 3D-aligned human reaction motions from egocentric video in real-time, addressing spatial misalignment and causality challenges.

Details

Motivation: Existing datasets (like ViMo) suffer from spatial inconsistency between egocentric video and reaction motion (e.g., dynamic motions paired with fixed-camera videos), making it challenging to model adaptive, context-sensitive human responses from egocentric video while maintaining strict causality and precise 3D spatial alignment.

Method: 1) Construct Human Reaction Dataset (HRD) to address data scarcity and misalignment with spatially aligned egocentric video-reaction pairs. 2) Develop EgoReAct framework: compress reaction motion into compact latent space via Vector Quantised-VAE, then train Generative Pre-trained Transformer for autoregressive reaction generation from visual input. 3) Incorporate 3D dynamic features (metric depth, head dynamics) to enhance spatial grounding.

Result: EgoReAct achieves remarkably higher realism, spatial consistency, and generation efficiency compared to prior methods while maintaining strict causality during generation. The framework operates in real-time from egocentric video streams.

Conclusion: EgoReAct successfully addresses the dual challenges of causal generation and 3D spatial alignment for human reaction modeling from egocentric video, demonstrating superior performance through the combination of a novel dataset (HRD) and an autoregressive generation framework with 3D dynamic features.

Abstract: Humans exhibit adaptive, context-sensitive responses to egocentric visual input. However, faithfully modeling such reactions from egocentric video remains challenging due to the dual requirements of strictly causal generation and precise 3D spatial alignment. To tackle this problem, we first construct the Human Reaction Dataset (HRD) to address data scarcity and misalignment by building a spatially aligned egocentric video-reaction dataset, as existing datasets (e.g., ViMo) suffer from significant spatial inconsistency between the egocentric video and reaction motion, e.g., dynamically moving motions are always paired with fixed-camera videos. Leveraging HRD, we present EgoReAct, the first autoregressive framework that generates 3D-aligned human reaction motions from egocentric video streams in real-time. We first compress the reaction motion into a compact yet expressive latent space via a Vector Quantised-Variational AutoEncoder and then train a Generative Pre-trained Transformer for reaction generation from the visual input. EgoReAct incorporates 3D dynamic features, i.e., metric depth, and head dynamics during the generation, which effectively enhance spatial grounding. Extensive experiments demonstrate that EgoReAct achieves remarkably higher realism, spatial consistency, and generation efficiency compared with prior methods, while maintaining strict causality during generation. We will release code, models, and data upon acceptance.

[201] Depth Anything in $360^\circ$: Towards Scale Invariance in the Wild

Hualie Jiang, Ziyang Song, Zhiqiang Lou, Rui Xu, Minglang Tan

Main category: cs.CV

TL;DR: DA360 adapts Depth Anything V2 for panoramic depth estimation by learning a shift parameter to transform scale- and shift-invariant outputs into scale-invariant estimates, plus circular padding for seamless 360° depth maps, achieving state-of-the-art zero-shot performance.

Details

Motivation: Panoramic depth estimation has limited zero-shot generalization to open-world domains compared to perspective images, which benefit from abundant training data. There's a need to bridge this gap by transferring capabilities from the perspective domain to panoramic settings.

Method: DA360 adapts Depth Anything V2 for panoramic use by: 1) Learning a shift parameter from the ViT backbone to transform scale- and shift-invariant outputs into scale-invariant estimates that directly yield well-formed 3D point clouds, and 2) Integrating circular padding into the DPT decoder to eliminate seam artifacts and ensure spatially coherent depth maps that respect spherical continuity.

Result: DA360 shows substantial gains over its base model: over 50% relative depth error reduction on indoor benchmarks and over 10% on outdoor benchmarks. It significantly outperforms existing panoramic depth estimation methods, achieving about 30% relative error improvement compared to PanDA across all three test datasets, establishing new state-of-the-art for zero-shot panoramic depth estimation.

Conclusion: DA360 successfully bridges the gap between perspective and panoramic depth estimation by adapting a perspective-based model with innovative techniques for shift parameter learning and circular padding, achieving superior zero-shot generalization performance in both indoor and outdoor panoramic environments.

Abstract: Panoramic depth estimation provides a comprehensive solution for capturing complete $360^\circ$ environmental structural information, offering significant benefits for robotics and AR/VR applications. However, while extensively studied in indoor settings, its zero-shot generalization to open-world domains lags far behind perspective images, which benefit from abundant training data. This disparity makes transferring capabilities from the perspective domain an attractive solution. To bridge this gap, we present Depth Anything in $360^\circ$ (DA360), a panoramic-adapted version of Depth Anything V2. Our key innovation involves learning a shift parameter from the ViT backbone, transforming the model’s scale- and shift-invariant output into a scale-invariant estimate that directly yields well-formed 3D point clouds. This is complemented by integrating circular padding into the DPT decoder to eliminate seam artifacts, ensuring spatially coherent depth maps that respect spherical continuity. Evaluated on standard indoor benchmarks and our newly curated outdoor dataset, Metropolis, DA360 shows substantial gains over its base model, achieving over 50% and 10% relative depth error reduction on indoor and outdoor benchmarks, respectively. Furthermore, DA360 significantly outperforms robust panoramic depth estimation methods, achieving about 30% relative error improvement compared to PanDA across all three test datasets and establishing new state-of-the-art performance for zero-shot panoramic depth estimation.

[202] KANO: Kolmogorov-Arnold Neural Operator for Image Super-Resolution

Chenyu Li, Danfeng Hong, Bing Zhang, Zhaojie Pan, Jocelyn Chanussot

Main category: cs.CV

TL;DR: Proposes KANO, a novel interpretable operator for single-image super-resolution based on Kolmogorov-Arnold theorem, using B-spline functions to transparently model degradation processes.

Details

Motivation: Existing SR methods use black-box networks that make degradation processes unknown and uncontrollable. There's a need for interpretable approaches that provide transparent modeling of complex degradation mechanisms.

Method: Kolmogorov-Arnold Neural Operator (KANO) uses additive structure of finite B-spline functions to approximate spectral curves piecewise. Learns shape parameters of spline functions to capture local linear trends and peak-valley structures at nonlinear inflection points.

Result: KANO accurately captures key spectral characteristics and provides physical interpretability. Comparative study shows advantages of KANs over MLPs in characterizing complex degradation mechanisms for interpretable SR.

Conclusion: KANO offers a transparent, structured representation of latent degradation fitting process, advancing interpretable SR techniques with valuable insights for handling complex sequence fitting tasks.

Abstract: The highly nonlinear degradation process, complex physical interactions, and various sources of uncertainty render single-image Super-resolution (SR) a particularly challenging task. Existing interpretable SR approaches, whether based on prior learning or deep unfolding optimization frameworks, typically rely on black-box deep networks to model latent variables, which leaves the degradation process largely unknown and uncontrollable. Inspired by the Kolmogorov-Arnold theorem (KAT), we for the first time propose a novel interpretable operator, termed Kolmogorov-Arnold Neural Operator (KANO), with the application to image SR. KANO provides a transparent and structured representation of the latent degradation fitting process. Specifically, we employ an additive structure composed of a finite number of B-spline functions to approximate continuous spectral curves in a piecewise fashion. By learning and optimizing the shape parameters of these spline functions within defined intervals, our KANO accurately captures key spectral characteristics, such as local linear trends and the peak-valley structures at nonlinear inflection points, thereby endowing SR results with physical interpretability. Furthermore, through theoretical modeling and experimental evaluations across natural images, aerial photographs, and satellite remote sensing data, we systematically compare multilayer perceptrons (MLPs) and Kolmogorov-Arnold networks (KANs) in handling complex sequence fitting tasks. This comparative study elucidates the respective advantages and limitations of these models in characterizing intricate degradation mechanisms, offering valuable insights for the development of interpretable SR techniques.

[203] 3D Scene Change Modeling With Consistent Multi-View Aggregation

Zirui Zhou, Junfeng Ni, Shujie Zhang, Yixin Chen, Siyuan Huang

Main category: cs.CV

TL;DR: SCaR-3D is a novel 3D scene change detection framework that identifies object-level changes from dense-view pre-change and sparse-view post-change images, using signed-distance-based 2D differencing and 3DGS-based multi-view aggregation to robustly separate pre- and post-change states.

Details

Motivation: Existing 3D change detection methods suffer from spatial inconsistency in detected changes and fail to explicitly separate pre- and post-change states, limiting their effectiveness in scene monitoring, exploration, and continual reconstruction applications.

Method: The approach consists of: 1) signed-distance-based 2D differencing module, 2) multi-view aggregation with voting and pruning leveraging 3DGS consistency, 3) continual scene reconstruction strategy that selectively updates dynamic regions while preserving unchanged areas, and 4) CCS3D synthetic dataset for controlled evaluations.

Result: Extensive experiments show the method achieves both high accuracy and efficiency, outperforming existing methods in 3D change detection.

Conclusion: SCaR-3D effectively addresses limitations of existing 3D change detection methods by providing spatially consistent change detection with explicit separation of pre- and post-change states, enabling robust continual scene reconstruction.

Abstract: Change detection plays a vital role in scene monitoring, exploration, and continual reconstruction. Existing 3D change detection methods often exhibit spatial inconsistency in the detected changes and fail to explicitly separate pre- and post-change states. To address these limitations, we propose SCaR-3D, a novel 3D scene change detection framework that identifies object-level changes from a dense-view pre-change image sequence and sparse-view post-change images. Our approach consists of a signed-distance-based 2D differencing module followed by multi-view aggregation with voting and pruning, leveraging the consistent nature of 3DGS to robustly separate pre- and post-change states. We further develop a continual scene reconstruction strategy that selectively updates dynamic regions while preserving the unchanged areas. We also contribute CCS3D, a challenging synthetic dataset that allows flexible combinations of 3D change types to support controlled evaluations. Extensive experiments demonstrate that our method achieves both high accuracy and efficiency, outperforming existing methods.

[204] A Minimal Solver for Relative Pose Estimation with Unknown Focal Length from Two Affine Correspondences

Zhenbao Yu, Shirong Ye, Ronghe Jin, Shunkun Liang, Zibin Liu, Huiyun Zhang, Banglei Guan

Main category: cs.CV

TL;DR: A new solver estimates 3DOF relative pose and focal length from two affine correspondences when vertical direction is known from IMU measurements.

Details

Motivation: In applications like self-driving cars, smartphones, and UAVs, cameras are often combined with IMUs. The vertical direction from IMU measurements reduces relative pose estimation complexity from 5DOF to 3DOF, enabling more efficient estimation with fewer correspondences.

Method: 1) Establish constraint equations from two affine correspondences with known vertical direction; 2) Derive four equations involving only focal length and relative rotation angle using properties of equation system with nontrivial solutions; 3) Use polynomial eigenvalue method to solve for focal length and relative rotation angle.

Result: The proposed solver outperforms existing state-of-the-art solvers on both synthetic and real-world datasets.

Conclusion: The method successfully leverages IMU-measured vertical direction to reduce degrees of freedom, enabling efficient and accurate relative pose and focal length estimation from just two affine correspondences.

Abstract: In this paper, we aim to estimate the relative pose and focal length between two views with known intrinsic parameters except for an unknown focal length from two affine correspondences (ACs). Cameras are commonly used in combination with inertial measurement units (IMUs) in applications such as self-driving cars, smartphones, and unmanned aerial vehicles. The vertical direction of camera views can be obtained by IMU measurements. The relative pose between two cameras is reduced from 5DOF to 3DOF. We propose a new solver to estimate the 3DOF relative pose and focal length. First, we establish constraint equations from two affine correspondences when the vertical direction is known. Then, based on the properties of the equation system with nontrivial solutions, four equations can be derived. These four equations only involve two parameters: the focal length and the relative rotation angle. Finally, the polynomial eigenvalue method is utilized to solve the problem of focal length and relative rotation angle. The proposed solver is evaluated using synthetic and real-world datasets. The results show that our solver performs better than the existing state-of-the-art solvers.

[205] ByteLoom: Weaving Geometry-Consistent Human-Object Interactions through Progressive Curriculum Learning

Bangya Liu, Xinyu Gong, Zelin Zhao, Ziyang Song, Yulei Lu, Suhui Wu, Jun Zhang, Suman Banerjee, Hao Zhang

Main category: cs.CV

TL;DR: ByteLoom is a Diffusion Transformer framework for generating realistic human-object interaction videos with geometrically consistent objects, using simplified human conditioning and 3D object inputs without heavy reliance on hand mesh annotations.

Details

Motivation: Existing HOI video generation methods have two critical limitations: (1) lack of effective mechanisms to inject multi-view object information, leading to poor cross-view consistency, and (2) heavy reliance on fine-grained hand mesh annotations for modeling interaction occlusions.

Method: ByteLoom uses a Diffusion Transformer framework with an RCM-cache mechanism that leverages Relative Coordinate Maps as a universal representation to maintain object geometry consistency and control 6-DoF object transformations. It also employs a progressive training curriculum to compensate for HOI dataset scarcity and reduce reliance on hand mesh annotations.

Result: Extensive experiments demonstrate that the method faithfully preserves human identity and object’s multi-view geometry while maintaining smooth motion and object manipulation.

Conclusion: ByteLoom addresses key limitations in HOI video generation by providing geometrically consistent object illustration with simplified conditioning, making it suitable for applications in digital humans, e-commerce, advertising, and robotics imitation learning.

Abstract: Human-object interaction (HOI) video generation has garnered increasing attention due to its promising applications in digital humans, e-commerce, advertising, and robotics imitation learning. However, existing methods face two critical limitations: (1) a lack of effective mechanisms to inject multi-view information of the object into the model, leading to poor cross-view consistency, and (2) heavy reliance on fine-grained hand mesh annotations for modeling interaction occlusions. To address these challenges, we introduce ByteLoom, a Diffusion Transformer (DiT)-based framework that generates realistic HOI videos with geometrically consistent object illustration, using simplified human conditioning and 3D object inputs. We first propose an RCM-cache mechanism that leverages Relative Coordinate Maps (RCM) as a universal representation to maintain object’s geometry consistency and precisely control 6-DoF object transformations in the meantime. To compensate HOI dataset scarcity and leverage existing datasets, we further design a training curriculum that enhances model capabilities in a progressive style and relaxes the demand of hand mesh. Extensive experiments demonstrate that our method faithfully preserves human identity and the object’s multi-view geometry, while maintaining smooth motion and object manipulation.

Zhuonan Liu, Xinyu Zhang, Zishuo Wang, Tomohito Kawabata, Xuesu Xiao, Ling Xiao

Main category: cs.CV

TL;DR: MUSON is a multimodal dataset for short-horizon social navigation with structured Chain-of-Thought annotations to address limitations in existing datasets lacking reasoning supervision and having imbalanced action distributions.

Details

Motivation: Existing social navigation datasets lack explicit reasoning supervision and have highly long-tailed action distributions, which limits models' ability to learn safety-critical behaviors for socially compliant navigation.

Method: Created MUSON dataset with multimodal data collected across diverse indoor/outdoor campus scenes, featuring structured five-step Chain-of-Thought annotations (perception, prediction, reasoning, action, explanation) with explicit modeling of static physical constraints and balanced discrete action space.

Result: Qwen2.5-VL-3B achieved highest decision accuracy of 0.8625 on MUSON benchmark, demonstrating the dataset’s effectiveness as a reusable benchmark for socially compliant navigation.

Conclusion: MUSON addresses critical gaps in social navigation datasets by providing structured reasoning supervision and balanced action distributions, serving as an effective benchmark for developing socially compliant navigation models.

Abstract: Socially compliant navigation requires structured reasoning over dynamic pedestrians and physical constraints to ensure safe and interpretable decisions. However, existing social navigation datasets often lack explicit reasoning supervision and exhibit highly long-tailed action distributions, limiting models’ ability to learn safety-critical behaviors. To address these issues, we introduce MUSON, a multimodal dataset for short-horizon social navigation collected across diverse indoor and outdoor campus scenes. MUSON adopts a structured five-step Chain-of-Thought annotation consisting of perception, prediction, reasoning, action, and explanation, with explicit modeling of static physical constraints and a rationally balanced discrete action space. Compared to SNEI, MUSON provides consistent reasoning, action, and explanation. Benchmarking multiple state-of-the-art Small Vision Language Models on MUSON shows that Qwen2.5-VL-3B achieves the highest decision accuracy of 0.8625, demonstrating that MUSON serves as an effective and reusable benchmark for socially compliant navigation. The dataset is publicly available at https://huggingface.co/datasets/MARSLab/MUSON

[207] SwinTF3D: A Lightweight Multimodal Fusion Approach for Text-Guided 3D Medical Image Segmentation

Hasan Faraz Khan, Noor Fatima, Muzammil Behzad

Main category: cs.CV

TL;DR: SwinTF3D is a lightweight multimodal fusion model that combines visual and linguistic representations for text-guided 3D medical image segmentation, achieving competitive performance with low computational overhead.

Details

Motivation: Existing 3D segmentation frameworks rely solely on visual learning from large annotated datasets, limiting adaptability to new domains and clinical tasks. They lack semantic understanding to address flexible, user-defined segmentation objectives.

Method: Proposes SwinTF3D with a transformer-based visual encoder to extract volumetric features, integrated with a compact text encoder via an efficient fusion mechanism. This allows understanding natural-language prompts and aligning semantic cues with spatial structures in medical volumes.

Result: Extensive experiments on BTCV dataset show competitive Dice and IoU scores across multiple organs. The model generalizes well to unseen data and offers significant efficiency gains compared to conventional transformer-based segmentation networks.

Conclusion: SwinTF3D establishes a practical and interpretable paradigm for interactive, text-driven 3D medical image segmentation, opening perspectives for more adaptive and resource-efficient solutions in clinical imaging by bridging visual perception with linguistic understanding.

Abstract: The recent integration of artificial intelligence into medical imaging has driven remarkable advances in automated organ segmentation. However, most existing 3D segmentation frameworks rely exclusively on visual learning from large annotated datasets restricting their adaptability to new domains and clinical tasks. The lack of semantic understanding in these models makes them ineffective in addressing flexible, user-defined segmentation objectives. To overcome these limitations, we propose SwinTF3D, a lightweight multimodal fusion approach that unifies visual and linguistic representations for text-guided 3D medical image segmentation. The model employs a transformer-based visual encoder to extract volumetric features and integrates them with a compact text encoder via an efficient fusion mechanism. This design allows the system to understand natural-language prompts and correctly align semantic cues with their corresponding spatial structures in medical volumes, while producing accurate, context-aware segmentation results with low computational overhead. Extensive experiments on the BTCV dataset demonstrate that SwinTF3D achieves competitive Dice and IoU scores across multiple organs, despite its compact architecture. The model generalizes well to unseen data and offers significant efficiency gains compared to conventional transformer-based segmentation networks. Bridging visual perception with linguistic understanding, SwinTF3D establishes a practical and interpretable paradigm for interactive, text-driven 3D medical image segmentation, opening perspectives for more adaptive and resource-efficient solutions in clinical imaging.

[208] Learning Anatomy from Multiple Perspectives via Self-supervision in Chest Radiographs

Ziyu Zhou, Haozhe Luo, Mohammad Reza Hosseinzadeh Taher, Jiaxuan Pang, Xiaowei Ding, Michael B. Gotway, Jianming Liang

Main category: cs.CV

TL;DR: Lamps is a self-supervised learning method for medical imaging that leverages anatomical consistency, coherence, and hierarchy as supervision signals, outperforming 10 baseline models across 10 datasets.

Details

Motivation: Existing SSL methods in medical imaging overlook the fundamental anatomical structure of human body images, limiting their ability to learn meaningful anatomical features that are essential for medical foundation models.

Method: Lamps pre-trains on large-scale chest radiographs by harmoniously utilizing three anatomical perspectives as supervision: consistency, coherence, and hierarchy of human anatomy.

Result: Extensive experiments across 10 datasets show Lamps’ superior robustness, transferability, and clinical potential compared to 10 baseline models, demonstrating better anatomical alignment.

Conclusion: By learning from multiple anatomical perspectives, Lamps enables foundation models to develop meaningful, robust representations aligned with human anatomy structure, offering unique opportunities for medical imaging.

Abstract: Foundation models have been successful in natural language processing and computer vision because they are capable of capturing the underlying structures (foundation) of natural languages. However, in medical imaging, the key foundation lies in human anatomy, as these images directly represent the internal structures of the body, reflecting the consistency, coherence, and hierarchy of human anatomy. Yet, existing self-supervised learning (SSL) methods often overlook these perspectives, limiting their ability to effectively learn anatomical features. To overcome the limitation, we built Lamps (learning anatomy from multiple perspectives via self-supervision) pre-trained on large-scale chest radiographs by harmoniously utilizing the consistency, coherence, and hierarchy of human anatomy as the supervision signal. Extensive experiments across 10 datasets evaluated through fine-tuning and emergent property analysis demonstrate Lamps’ superior robustness, transferability, and clinical potential when compared to 10 baseline models. By learning from multiple perspectives, Lamps presents a unique opportunity for foundation models to develop meaningful, robust representations that are aligned with the structure of human anatomy.

[209] Let Samples Speak: Mitigating Spurious Correlation by Exploiting the Clusterness of Samples

Weiwei Li, Junzhuo Liu, Yuanyuan Ren, Yuchen Zheng, Yahao Liu, Wen Li

Main category: cs.CV

TL;DR: Proposes a data-oriented pipeline (NSF) to mitigate spurious correlations in deep learning by identifying dispersed spurious features, neutralizing them via grouping, learning feature transformations, and updating classifiers, achieving >20% improvement in worst-group accuracy.

Details

Motivation: Deep learning models often learn spurious features that correlate with class labels but are irrelevant to the prediction task. Existing methods require manual annotation of spurious attributes or rely on empirical assumptions about bias simplicity, which may not capture the intricate nature of real-world spurious correlations.

Method: Four-step pipeline: 1) Identify spurious features by observing dispersed distribution patterns in feature space, 2) Neutralize spurious features using grouping strategy to obtain bias-invariant representations, 3) Learn feature transformations to eliminate spurious features by aligning with bias-invariant representations, 4) Update classifier with learned transformations to obtain unbiased model.

Result: Experiments on image and NLP debiasing benchmarks show improvement in worst group accuracy of more than 20% compared to standard empirical risk minimization (ERM).

Conclusion: The proposed NSF pipeline effectively mitigates spurious correlations without requiring manual annotation of spurious attributes, providing a practical data-oriented solution for debiasing deep learning models across vision and NLP domains.

Abstract: Deep learning models are known to often learn features that spuriously correlate with the class label during training but are irrelevant to the prediction task. Existing methods typically address this issue by annotating potential spurious attributes, or filtering spurious features based on some empirical assumptions (e.g., simplicity of bias). However, these methods may yield unsatisfactory performance due to the intricate and elusive nature of spurious correlations in real-world data. In this paper, we propose a data-oriented approach to mitigate the spurious correlation in deep learning models. We observe that samples that are influenced by spurious features tend to exhibit a dispersed distribution in the learned feature space. This allows us to identify the presence of spurious features. Subsequently, we obtain a bias-invariant representation by neutralizing the spurious features based on a simple grouping strategy. Then, we learn a feature transformation to eliminate the spurious features by aligning with this bias-invariant representation. Finally, we update the classifier by incorporating the learned feature transformation and obtain an unbiased model. By integrating the aforementioned identifying, neutralizing, eliminating and updating procedures, we build an effective pipeline for mitigating spurious correlation. Experiments on image and NLP debiasing benchmarks show an improvement in worst group accuracy of more than 20% compared to standard empirical risk minimization (ERM). Codes and checkpoints are available at https://github.com/davelee-uestc/nsf_debiasing .

[210] M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models

Ju-Hsuan Weng, Jia-Wei Liao, Cheng-Fu Chou, Jun-Cheng Chen

Main category: cs.CV

TL;DR: M-ErasureBench is a multimodal evaluation framework for concept erasure methods, showing existing approaches fail against learned embeddings and inverted latents. IRECE is proposed as a plug-and-play module that enhances robustness by localizing target concepts and perturbing latents during denoising.

Details

Motivation: Existing concept erasure methods focus only on text prompts, ignoring other input modalities like learned embeddings and inverted latents that are increasingly important in real-world applications. These modalities can serve as attack surfaces where erased concepts re-emerge despite defenses.

Method: Introduces M-ErasureBench, a multimodal evaluation framework that benchmarks concept erasure across three input modalities: text prompts, learned embeddings, and inverted latents (with white-box and black-box access). Proposes IRECE (Inference-time Robustness Enhancement for Concept Erasure), which localizes target concepts via cross-attention and perturbs associated latents during denoising.

Result: Existing methods achieve strong erasure against text prompts but largely fail under learned embeddings and inverted latents, with Concept Reproduction Rate (CRR) exceeding 90% in white-box setting. IRECE reduces CRR by up to 40% under the most challenging white-box latent inversion scenario while preserving visual quality.

Conclusion: M-ErasureBench provides the first comprehensive benchmark of concept erasure beyond text prompts. Together with IRECE, it offers practical safeguards for building more reliable protective generative models by addressing vulnerabilities in multimodal attack surfaces.

Abstract: Text-to-image diffusion models may generate harmful or copyrighted content, motivating research on concept erasure. However, existing approaches primarily focus on erasing concepts from text prompts, overlooking other input modalities that are increasingly critical in real-world applications such as image editing and personalized generation. These modalities can become attack surfaces, where erased concepts re-emerge despite defenses. To bridge this gap, we introduce M-ErasureBench, a novel multimodal evaluation framework that systematically benchmarks concept erasure methods across three input modalities: text prompts, learned embeddings, and inverted latents. For the latter two, we evaluate both white-box and black-box access, yielding five evaluation scenarios. Our analysis shows that existing methods achieve strong erasure performance against text prompts but largely fail under learned embeddings and inverted latents, with Concept Reproduction Rate (CRR) exceeding 90% in the white-box setting. To address these vulnerabilities, we propose IRECE (Inference-time Robustness Enhancement for Concept Erasure), a plug-and-play module that localizes target concepts via cross-attention and perturbs the associated latents during denoising. Experiments demonstrate that IRECE consistently restores robustness, reducing CRR by up to 40% under the most challenging white-box latent inversion scenario, while preserving visual quality. To the best of our knowledge, M-ErasureBench provides the first comprehensive benchmark of concept erasure beyond text prompts. Together with IRECE, our benchmark offers practical safeguards for building more reliable protective generative models.

[211] Guided Path Sampling: Steering Diffusion Models Back on Track with Principled Path Guidance

Haosen Li, Wenshuo Chen, Shaofeng Liang, Lei Wang, Haozhe Jia, Yutao Yue

Main category: cs.CV

TL;DR: GPS (Guided Path Sampling) fixes CFG’s instability in iterative refinement by replacing extrapolation with manifold-constrained interpolation, ensuring bounded error and better image quality.

Details

Motivation: Standard Classifier-Free Guidance (CFG) causes iterative refinement methods to fail by pushing sampling paths off the data manifold, leading to divergent approximation errors that undermine refinement quality.

Method: Proposes Guided Path Sampling (GPS) which replaces CFG’s unstable extrapolation with principled, manifold-constrained interpolation to keep sampling paths on the data manifold. Also includes optimal scheduling that dynamically adjusts guidance strength to align with coarse-to-fine generation.

Result: GPS outperforms existing methods on SDXL and Hunyuan-DiT, achieving ImageReward of 0.79 and HPS v2 of 0.2995 on SDXL, and improving semantic alignment accuracy on GenEval to 57.45%.

Conclusion: Path stability is essential for effective iterative refinement, and GPS provides a robust framework to achieve it by transforming error series from unbounded amplification to strictly bounded.

Abstract: Iterative refinement methods based on a denoising-inversion cycle are powerful tools for enhancing the quality and control of diffusion models. However, their effectiveness is critically limited when combined with standard Classifier-Free Guidance (CFG). We identify a fundamental limitation: CFG’s extrapolative nature systematically pushes the sampling path off the data manifold, causing the approximation error to diverge and undermining the refinement process. To address this, we propose Guided Path Sampling (GPS), a new paradigm for iterative refinement. GPS replaces unstable extrapolation with a principled, manifold-constrained interpolation, ensuring the sampling path remains on the data manifold. We theoretically prove that this correction transforms the error series from unbounded amplification to strictly bounded, guaranteeing stability. Furthermore, we devise an optimal scheduling strategy that dynamically adjusts guidance strength, aligning semantic injection with the model’s natural coarse-to-fine generation process. Extensive experiments on modern backbones like SDXL and Hunyuan-DiT show that GPS outperforms existing methods in both perceptual quality and complex prompt adherence. For instance, GPS achieves a superior ImageReward of 0.79 and HPS v2 of 0.2995 on SDXL, while improving overall semantic alignment accuracy on GenEval to 57.45%. Our work establishes that path stability is a prerequisite for effective iterative refinement, and GPS provides a robust framework to achieve it.

Kai Liu, Jungang Li, Yuchong Sun, Shengqiong Wu, Jianzhang Gao, Daoan Zhang, Wei Zhang, Sheng Jin, Sicheng Yu, Geng Zhan, Jiayi Ji, Fan Zhou, Liang Zheng, Shuicheng Yan, Hao Fei, Tat-Seng Chua

Main category: cs.CV

TL;DR: JavisGPT is the first unified multimodal LLM for joint audio-video comprehension and generation, using a SyncFusion module and three-stage training to achieve synchronized audio-video understanding and creation.

Details

Motivation: There's a need for a unified model that can jointly understand and generate synchronized audio-video content, as existing multimodal LLMs lack this capability for temporally coherent audio-video processing.

Method: Uses encoder-LLM-decoder architecture with SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware queries. Trained in three stages: multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning using JavisInst-Omni dataset (200K+ GPT-4o-curated dialogues).

Result: Outperforms existing MLLMs on JAV comprehension and generation benchmarks, especially in complex and temporally synchronized settings.

Conclusion: JavisGPT successfully demonstrates unified joint audio-video comprehension and generation capabilities, setting new standards for synchronized multimodal understanding and creation.

Abstract: This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for Joint Audio-Video (JAV) comprehension and generation. JavisGPT adopts a concise encoder-LLM-decoder architecture, featuring a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. To support this, we further construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that span diverse and multi-level comprehension and generation scenarios. Extensive experiments on JAV comprehension and generation benchmarks show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.

[213] ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving

Qihang Peng, Xuesong Chen, Chenye Yang, Shaoshuai Shi, Hongsheng Li

Main category: cs.CV

TL;DR: ColaVLA is a vision-language-action framework for autonomous driving that transfers VLM reasoning to a latent space and uses hierarchical parallel planning to generate trajectories efficiently in real-time.

Details

Motivation: Current VLM-based autonomous driving planners face challenges: mismatch between discrete text reasoning and continuous control, high latency from autoregressive decoding, and inefficient/non-causal planners limiting real-time deployment.

Method: Two-component system: 1) Cognitive Latent Reasoner compresses scene understanding into decision-oriented meta-action embeddings using ego-adaptive selection with only two VLM forward passes; 2) Hierarchical Parallel Planner generates multi-scale, causality-consistent trajectories in a single forward pass.

Result: Achieves state-of-the-art performance on nuScenes benchmark in both open-loop and closed-loop settings with favorable efficiency and robustness.

Conclusion: ColaVLA preserves VLM generalization and interpretability while enabling efficient, accurate, and safe trajectory generation for autonomous driving.

Abstract: Autonomous driving requires generating safe and reliable trajectories from complex multimodal inputs. Traditional modular pipelines separate perception, prediction, and planning, while recent end-to-end (E2E) systems learn them jointly. Vision-language models (VLMs) further enrich this paradigm by introducing cross-modal priors and commonsense reasoning, yet current VLM-based planners face three key challenges: (i) a mismatch between discrete text reasoning and continuous control, (ii) high latency from autoregressive chain-of-thought decoding, and (iii) inefficient or non-causal planners that limit real-time deployment. We propose ColaVLA, a unified vision-language-action framework that transfers reasoning from text to a unified latent space and couples it with a hierarchical, parallel trajectory decoder. The Cognitive Latent Reasoner compresses scene understanding into compact, decision-oriented meta-action embeddings through ego-adaptive selection and only two VLM forward passes. The Hierarchical Parallel Planner then generates multi-scale, causality-consistent trajectories in a single forward pass. Together, these components preserve the generalization and interpretability of VLMs while enabling efficient, accurate and safe trajectory generation. Experiments on the nuScenes benchmark show that ColaVLA achieves state-of-the-art performance in both open-loop and closed-loop settings with favorable efficiency and robustness.

[214] OpenGround: Active Cognition-based Reasoning for Open-World 3D Visual Grounding

Wenyuan Huang, Zhao Wang, Zhou Wei, Ting Huang, Fang Zhao, Jian Yang, Zhenyu Zhang

Main category: cs.CV

TL;DR: OpenGround is a zero-shot framework for open-world 3D visual grounding that overcomes limitations of pre-defined object lookup tables through active cognition-based reasoning.

Details

Motivation: Existing 3D visual grounding methods rely on pre-defined Object Lookup Tables (OLTs) to query VLMs, which limits applications in scenarios with undefined or unforeseen targets. This restricts the ability to handle open-world scenarios where objects may not be pre-defined.

Method: OpenGround introduces an Active Cognition-based Reasoning (ACR) module that progressively augments VLM cognition through a cognitive task chain. It performs human-like perception of targets and actively reasons about contextually relevant objects using a dynamically updated OLT, enabling both pre-defined and open-world category handling.

Result: OpenGround achieves competitive performance on Nr3D, state-of-the-art on ScanRefer, and delivers a substantial 17.6% improvement on the new OpenTarget dataset containing over 7000 object-description pairs for open-world evaluation.

Conclusion: OpenGround successfully addresses the limitation of pre-defined OLTs in 3D visual grounding, enabling open-world applications through active cognition-based reasoning and dynamic OLT updates, with strong performance across multiple benchmarks.

Abstract: 3D visual grounding aims to locate objects based on natural language descriptions in 3D scenes. Existing methods rely on a pre-defined Object Lookup Table (OLT) to query Visual Language Models (VLMs) for reasoning about object locations, which limits the applications in scenarios with undefined or unforeseen targets. To address this problem, we present OpenGround, a novel zero-shot framework for open-world 3D visual grounding. Central to OpenGround is the Active Cognition-based Reasoning (ACR) module, which is designed to overcome the fundamental limitation of pre-defined OLTs by progressively augmenting the cognitive scope of VLMs. The ACR module performs human-like perception of the target via a cognitive task chain and actively reasons about contextually relevant objects, thereby extending VLM cognition through a dynamically updated OLT. This allows OpenGround to function with both pre-defined and open-world categories. We also propose a new dataset named OpenTarget, which contains over 7000 object-description pairs to evaluate our method in open-world scenarios. Extensive experiments demonstrate that OpenGround achieves competitive performance on Nr3D, state-of-the-art on ScanRefer, and delivers a substantial 17.6% improvement on OpenTarget. Project Page at this https URL.

[215] Learning Where to Focus: Density-Driven Guidance for Detecting Dense Tiny Objects

Zhicheng Zhao, Xuanang Fan, Lingma Sun, Chenglong Li, Jin Tang

Main category: cs.CV

TL;DR: DRMNet uses density maps as spatial priors to guide adaptive feature learning for detecting dense tiny objects in remote sensing imagery, outperforming state-of-the-art methods on challenging datasets.

Details

Motivation: High-resolution remote sensing imagery contains dense clusters of tiny objects that are challenging to detect due to severe mutual occlusion and limited pixel footprints. Existing methods allocate computational resources uniformly and fail to adaptively focus on density-concentrated regions, hindering feature learning effectiveness.

Method: Proposes Dense Region Mining Network (DRMNet) with three key components: 1) Density Generation Branch (DGB) to model object distribution patterns as spatial priors, 2) Dense Area Focusing Module (DAFM) that uses density maps to identify and focus computational resources on dense areas for efficient local-global feature interaction, and 3) Dual Filter Fusion Module (DFFM) that disentangles multi-scale features into high- and low-frequency components using discrete cosine transform and performs density-guided cross-attention to enhance complementarity while suppressing background interference.

Result: Extensive experiments on AI-TOD and DTOD datasets demonstrate that DRMNet surpasses state-of-the-art methods, particularly in complex scenarios with high object density and severe occlusion.

Conclusion: DRMNet effectively addresses the challenges of detecting dense tiny objects in remote sensing imagery by using density maps as explicit spatial priors to guide adaptive feature learning, enabling better focus on dense regions and overcoming limitations of uniform computational resource allocation in existing methods.

Abstract: High-resolution remote sensing imagery increasingly contains dense clusters of tiny objects, the detection of which is extremely challenging due to severe mutual occlusion and limited pixel footprints. Existing detection methods typically allocate computational resources uniformly, failing to adaptively focus on these density-concentrated regions, which hinders feature learning effectiveness. To address these limitations, we propose the Dense Region Mining Network (DRMNet), which leverages density maps as explicit spatial priors to guide adaptive feature learning. First, we design a Density Generation Branch (DGB) to model object distribution patterns, providing quantifiable priors that guide the network toward dense regions. Second, to address the computational bottleneck of global attention, our Dense Area Focusing Module (DAFM) uses these density maps to identify and focus on dense areas, enabling efficient local-global feature interaction. Finally, to mitigate feature degradation during hierarchical extraction, we introduce a Dual Filter Fusion Module (DFFM). It disentangles multi-scale features into high- and low-frequency components using a discrete cosine transform and then performs density-guided cross-attention to enhance complementarity while suppressing background interference. Extensive experiments on the AI-TOD and DTOD datasets demonstrate that DRMNet surpasses state-of-the-art methods, particularly in complex scenarios with high object density and severe occlusion.

[216] An Architecture-Led Hybrid Report on Body Language Detection Project

Thomson Tong, Diba Darooneh

Main category: cs.CV

TL;DR: Analysis of two vision-language models (Qwen2.5-VL-7B-Instruct and Llama-4-Scout-17B-16E-Instruct) and their application in a video-to-artifact pipeline for person detection with emotion attributes, highlighting architectural properties and practical implementation considerations.

Details

Motivation: To understand how modern vision-language models' architectural properties translate to practical video analysis systems, specifically for detecting visible people with emotion attributes, and to identify critical distinctions between model behavior and system requirements for robust implementation.

Method: Architecture-led analysis of two VLMs, mapping their properties to a video-to-artifact pipeline that samples frames, prompts VLMs for person detection with bounding boxes and emotion attributes, validates output structure using predefined schemas, and optionally renders annotated videos.

Result: Identified key system constraints: structured outputs can be syntactically valid but semantically incorrect, schema validation only checks structure (not geometric correctness), person identifiers are frame-local, and interactive analysis returns free-form text rather than schema-enforced JSON.

Conclusion: Understanding the distinctions between model behavior and system requirements is critical for writing defensible claims, designing robust interfaces, and planning evaluation in vision-language model applications for video analysis tasks.

Abstract: This report provides an architecture-led analysis of two modern vision-language models (VLMs), Qwen2.5-VL-7B-Instruct and Llama-4-Scout-17B-16E-Instruct, and explains how their architectural properties map to a practical video-to-artifact pipeline implemented in the BodyLanguageDetection repository [1]. The system samples video frames, prompts a VLM to detect visible people and generate pixel-space bounding boxes with prompt-conditioned attributes (emotion by default), validates output structure using a predefined schema, and optionally renders an annotated video. We first summarize the shared multimodal foundation (visual tokenization, Transformer attention, and instruction following), then describe each architecture at a level sufficient to justify engineering choices without speculative internals. Finally, we connect model behavior to system constraints: structured outputs can be syntactically valid while semantically incorrect, schema validation is structural (not geometric correctness), person identifiers are frame-local in the current prompting contract, and interactive single-frame analysis returns free-form text rather than schema-enforced JSON. These distinctions are critical for writing defensible claims, designing robust interfaces, and planning evaluation.

[217] CLIP-Joint-Detect: End-to-End Joint Training of Object Detectors with Contrastive Vision-Language Supervision

Behnam Raoufi, Hossein Sharify, Mohamad Mahdee Ramezanee, Khosrow Hajsadeghi, Saeed Bagheri Shouraki

Main category: cs.CV

TL;DR: CLIP-Joint-Detect integrates CLIP-style contrastive vision-language supervision into object detectors via joint training with learnable text embeddings, improving performance across architectures while maintaining real-time speed.

Details

Motivation: Conventional object detectors using cross-entropy classification are vulnerable to class imbalance and label noise. The authors aim to leverage CLIP's robust vision-language representations to create more resilient detectors.

Method: A detector-agnostic framework with a lightweight parallel head that projects region/grid features into CLIP embedding space and aligns them with learnable class-specific text embeddings using InfoNCE contrastive loss plus auxiliary cross-entropy, while maintaining all standard detection losses.

Result: Achieved consistent and substantial improvements on Pascal VOC 2007+2012 (with Faster R-CNN) and MS COCO 2017 (with YOLOv11), while preserving real-time inference speed. Extensive experiments show enhanced closed-set detection performance.

Conclusion: Joint optimization with learnable text embeddings significantly improves object detection performance across diverse architectures and datasets, demonstrating the value of integrating CLIP-style contrastive supervision into detection frameworks.

Abstract: Conventional object detectors rely on cross-entropy classification, which can be vulnerable to class imbalance and label noise. We propose CLIP-Joint-Detect, a simple and detector-agnostic framework that integrates CLIP-style contrastive vision-language supervision through end-to-end joint training. A lightweight parallel head projects region or grid features into the CLIP embedding space and aligns them with learnable class-specific text embeddings via InfoNCE contrastive loss and an auxiliary cross-entropy term, while all standard detection losses are optimized simultaneously. The approach applies seamlessly to both two-stage and one-stage architectures. We validate it on Pascal VOC 2007+2012 using Faster R-CNN and on the large-scale MS COCO 2017 benchmark using modern YOLO detectors (YOLOv11), achieving consistent and substantial improvements while preserving real-time inference speed. Extensive experiments and ablations demonstrate that joint optimization with learnable text embeddings markedly enhances closed-set detection performance across diverse architectures and datasets.

[218] Wavelet-based Multi-View Fusion of 4D Radar Tensor and Camera for Robust 3D Object Detection

Runwei Guan, Jianan Liu, Shaofeng Liang, Fangqiang Ding, Shanliang Yao, Xiaokai Bai, Daizong Liu, Tao Huang, Guoqiang Mao, Hui Xiong

Main category: cs.CV

TL;DR: WRCFormer: A novel 3D object detection framework that fuses raw 4D radar cubes with camera data using multi-view representations and wavelet attention, achieving state-of-the-art performance on K-Radar benchmarks.

Details

Motivation: 4D mmWave radar is cost-effective and robust for autonomous driving but suffers from sparsity and limited semantic information. Camera-radar fusion offers complementary strengths, but existing approaches either lose information through point-cloud processing or incur prohibitive computational costs with raw radar data.

Method: WRCFormer fuses raw radar cubes with camera inputs via multi-view representations of decoupled radar cubes. It uses a Wavelet Attention Module in a wavelet-based FPN to enhance sparse radar and image representations, and a two-stage query-based Geometry-guided Progressive Fusion mechanism to integrate multi-view features from both modalities.

Result: Achieves state-of-the-art performance on K-Radar benchmarks, surpassing the best model by approximately 2.4% in all scenarios and 1.6% in sleet scenarios, demonstrating robustness under adverse weather conditions.

Conclusion: WRCFormer effectively addresses the challenges of 4D radar-camera fusion by directly processing raw radar data with efficient multi-view representations and attention mechanisms, providing a robust solution for autonomous driving perception in various weather conditions.

Abstract: 4D millimeter-wave (mmWave) radar has been widely adopted in autonomous driving and robot perception due to its low cost and all-weather robustness. However, its inherent sparsity and limited semantic richness significantly constrain perception capability. Recently, fusing camera data with 4D radar has emerged as a promising cost effective solution, by exploiting the complementary strengths of the two modalities. Nevertheless, point-cloud-based radar often suffer from information loss introduced by multi-stage signal processing, while directly utilizing raw 4D radar data incurs prohibitive computational costs. To address these challenges, we propose WRCFormer, a novel 3D object detection framework that fuses raw radar cubes with camera inputs via multi-view representations of the decoupled radar cube. Specifically, we design a Wavelet Attention Module as the basic module of wavelet-based Feature Pyramid Network (FPN) to enhance the representation of sparse radar signals and image data. We further introduce a two-stage query-based, modality-agnostic fusion mechanism termed Geometry-guided Progressive Fusion to efficiently integrate multi-view features from both modalities. Extensive experiments demonstrate that WRCFormer achieves state-of-the-art performance on the K-Radar benchmarks, surpassing the best model by approximately 2.4% in all scenarios and 1.6% in the sleet scenario, highlighting its robustness under adverse weather conditions.

[219] YOLO-IOD: Towards Real Time Incremental Object Detection

Shizhou Zhang, Xueqiang Lv, Yinghui Xing, Qirui Wu, Di Xu, Chen Zhao, Yanning Zhang

Main category: cs.CV

TL;DR: YOLO-IOD is a real-time incremental object detection framework built on YOLO-World that addresses catastrophic forgetting in YOLO-based detectors through conflict-aware pseudo-label refinement, importance-based kernel selection, and cross-stage asymmetric knowledge distillation.

Details

Motivation: Current incremental object detection methods don't work with real-time YOLO frameworks, and YOLO-based detectors suffer from catastrophic forgetting due to three types of knowledge conflicts: foreground-background confusion, parameter interference, and misaligned knowledge distillation.

Method: YOLO-IOD uses a stage-wise parameter-efficient fine-tuning process with three components: 1) Conflict-Aware Pseudo-Label Refinement (CPR) to mitigate foreground-background confusion, 2) Importance-based Kernel Selection (IKS) to identify and update crucial convolution kernels, and 3) Cross-Stage Asymmetric Knowledge Distillation (CAKD) to address misaligned knowledge distillation by transmitting features through both previous and current teacher detectors.

Result: Experiments on conventional and new LoCo COCO benchmarks show YOLO-IOD achieves superior performance with minimal forgetting compared to existing methods.

Conclusion: YOLO-IOD successfully enables real-time incremental object detection on YOLO frameworks by addressing the three key knowledge conflicts that cause catastrophic forgetting, providing a practical solution for real-world applications.

Abstract: Current methods for incremental object detection (IOD) primarily rely on Faster R-CNN or DETR series detectors; however, these approaches do not accommodate the real-time YOLO detection frameworks. In this paper, we first identify three primary types of knowledge conflicts that contribute to catastrophic forgetting in YOLO-based incremental detectors: foreground-background confusion, parameter interference, and misaligned knowledge distillation. Subsequently, we introduce YOLO-IOD, a real-time Incremental Object Detection (IOD) framework that is constructed upon the pretrained YOLO-World model, facilitating incremental learning via a stage-wise parameter-efficient fine-tuning process. Specifically, YOLO-IOD encompasses three principal components: 1) Conflict-Aware Pseudo-Label Refinement (CPR), which mitigates the foreground-background confusion by leveraging the confidence levels of pseudo labels and identifying potential objects relevant to future tasks. 2) Importancebased Kernel Selection (IKS), which identifies and updates the pivotal convolution kernels pertinent to the current task during the current learning stage. 3) Cross-Stage Asymmetric Knowledge Distillation (CAKD), which addresses the misaligned knowledge distillation conflict by transmitting the features of the student target detector through the detection heads of both the previous and current teacher detectors, thereby facilitating asymmetric distillation between existing and newly introduced categories. We further introduce LoCo COCO, a more realistic benchmark that eliminates data leakage across stages. Experiments on both conventional and LoCo COCO benchmarks show that YOLO-IOD achieves superior performance with minimal forgetting.

[220] RealCamo: Boosting Real Camouflage Synthesis with Layout Controls and Textual-Visual Guidance

Chunyuan Chen, Yunuo Cai, Shujuan Li, Weiyun Liang, Bin Wang, Jing Xu

Main category: cs.CV

TL;DR: ReamCamo: A unified out-painting framework for realistic camouflaged image generation with layout controls and multi-modal guidance to address visual similarity and semantic coherence issues in existing methods.

Details

Motivation: Existing camouflaged image generation methods suffer from two main limitations: 1) generated images lack sufficient camouflage due to weak visual similarity, and 2) they exhibit cluttered backgrounds that are semantically inconsistent with foreground targets, creating a substantial gap from real camouflaged imagery.

Method: Proposes ReamCamo, a unified out-painting based framework that introduces: 1) additional layout controls to regulate global image structure and improve semantic coherence, and 2) multi-modal textual-visual conditions combining unified fine-grained textual task descriptions with texture-oriented background retrieval to enhance visual fidelity and realism.

Result: Extensive experiments and visualizations demonstrate the effectiveness of the proposed framework. The paper also introduces a quantitative background-foreground distribution divergence metric to measure camouflage effectiveness in generated images.

Conclusion: ReamCamo addresses key limitations in existing camouflaged image generation methods by providing better semantic coherence and visual realism through layout controls and multi-modal guidance, offering an improved approach for generating high-quality training data for camouflaged object detection.

Abstract: Camouflaged image generation (CIG) has recently emerged as an efficient alternative for acquiring high-quality training data for camouflaged object detection (COD). However, existing CIG methods still suffer from a substantial gap to real camouflaged imagery: generated images either lack sufficient camouflage due to weak visual similarity, or exhibit cluttered backgrounds that are semantically inconsistent with foreground targets. To address these limitations, we propose ReamCamo, a unified out-painting based framework for realistic camouflaged image generation. ReamCamo explicitly introduces additional layout controls to regulate global image structure, thereby improving semantic coherence between foreground objects and generated backgrounds. Moreover, we construct a multi-modal textual-visual condition by combining a unified fine-grained textual task description with texture-oriented background retrieval, which jointly guides the generation process to enhance visual fidelity and realism. To quantitatively assess camouflage quality, we further introduce a background-foreground distribution divergence metric that measures the effectiveness of camouflage in generated images. Extensive experiments and visualizations demonstrate the effectiveness of our proposed framework.

Huiming Yang, Linglin Liao, Fei Ding, Sibo Wang, Zijian Zeng

Main category: cs.CV

TL;DR: PoseStreamer is a robust multi-modal 6DoF pose estimation framework for high-speed moving objects, using event cameras to overcome motion blur limitations of RGB cameras in low-light scenarios.

Details

Motivation: Standard RGB cameras suffer from motion blur in high-speed and low-light scenarios, making 6DoF pose estimation challenging. Event cameras offer high temporal resolution but current methods still perform poorly in high-speed object movement scenarios.

Method: Three core components: 1) Adaptive Pose Memory Queue for temporal consistency using historical orientation cues, 2) Object-centric 2D Tracker providing strong 2D priors to boost 3D center recall, and 3) Ray Pose Filter for geometric refinement along camera rays. Also introduces MoCapCube6D dataset for benchmarking rapid motion performance.

Result: PoseStreamer achieves superior accuracy in high-speed moving scenarios and exhibits strong generalizability as a template-free framework for unseen moving objects.

Conclusion: The proposed framework effectively addresses the challenges of 6DoF pose estimation in high-speed scenarios by leveraging event cameras and novel architectural components, demonstrating both accuracy and generalizability.

Abstract: Six degree of freedom (6DoF) pose estimation for novel objects is a critical task in computer vision, yet it faces significant challenges in high-speed and low-light scenarios where standard RGB cameras suffer from motion blur. While event cameras offer a promising solution due to their high temporal resolution, current 6DoF pose estimation methods typically yield suboptimal performance in high-speed object moving scenarios. To address this gap, we propose PoseStreamer, a robust multi-modal 6DoF pose estimation framework designed specifically on high-speed moving scenarios. Our approach integrates three core components: an Adaptive Pose Memory Queue that utilizes historical orientation cues for temporal consistency, an Object-centric 2D Tracker that provides strong 2D priors to boost 3D center recall, and a Ray Pose Filter for geometric refinement along camera rays. Furthermore, we introduce MoCapCube6D, a novel multi-modal dataset constructed to benchmark performance under rapid motion. Extensive experiments demonstrate that PoseStreamer not only achieves superior accuracy in high-speed moving scenarios, but also exhibits strong generalizability as a template-free framework for unseen moving objects.

[222] Spatial-aware Symmetric Alignment for Text-guided Medical Image Segmentation

Linglin Liao, Qichuan Geng, Yu Liu

Main category: cs.CV

TL;DR: SSA framework enhances medical image segmentation by aligning image regions with hybrid medical texts containing locational, descriptive, and diagnostic information, using symmetric optimal transport and spatial guidance.

Details

Motivation: Current text-guided medical image segmentation methods struggle with processing both diagnostic and descriptive texts simultaneously, and fail to capture positional constraints, leading to inaccurate segmentation (e.g., confusing left/right lung locations).

Method: Proposes Spatial-aware Symmetric Alignment (SSA) framework with: 1) symmetric optimal transport alignment mechanism for bi-directional fine-grained multimodal correspondences, and 2) composite directional guidance strategy using region-level guidance masks to introduce explicit spatial constraints.

Result: Extensive experiments on public benchmarks show SSA achieves state-of-the-art performance, particularly in accurately segmenting lesions with spatial relational constraints.

Conclusion: SSA effectively addresses limitations of existing methods by enabling simultaneous processing of hybrid medical texts and incorporating spatial constraints, leading to more accurate medical image segmentation.

Abstract: Text-guided Medical Image Segmentation has shown considerable promise for medical image segmentation, with rich clinical text serving as an effective supplement for scarce data. However, current methods have two key bottlenecks. On one hand, they struggle to process diagnostic and descriptive texts simultaneously, making it difficult to identify lesions and establish associations with image regions. On the other hand, existing approaches focus on lesions description and fail to capture positional constraints, leading to critical deviations. Specifically, with the text “in the left lower lung”, the segmentation results may incorrectly cover both sides of the lung. To address the limitations, we propose the Spatial-aware Symmetric Alignment (SSA) framework to enhance the capacity of referring hybrid medical texts consisting of locational, descriptive, and diagnostic information. Specifically, we propose symmetric optimal transport alignment mechanism to strengthen the associations between image regions and multiple relevant expressions, which establishes bi-directional fine-grained multimodal correspondences. In addition, we devise a composite directional guidance strategy that explicitly introduces spatial constraints in the text by constructing region-level guidance masks. Extensive experiments on public benchmarks demonstrate that SSA achieves state-of-the-art (SOTA) performance, particularly in accurately segmenting lesions characterized by spatial relational constraints.

[223] MedSAM-based lung masking for multi-label chest X-ray classification

Brayden Miao, Zain Rehman, Xin Miao, Siming Liu, Jianjie Wang

Main category: cs.CV

TL;DR: Segmentation-guided CXR classification using MedSAM for lung extraction improves normal case screening but shows task/architecture-dependent effects on abnormality detection.

Details

Motivation: Automated CXR interpretation faces challenges due to weak disease signals, dataset bias, and limited spatial supervision. Foundation models like MedSAM offer anatomically grounded priors to improve robustness and interpretability in CXR analysis.

Method: Proposed segmentation-guided CXR classification pipeline: fine-tuned MedSAM for lung region extraction, applied to NIH CXR dataset subset, trained/evaluated deep CNNs for multi-label prediction of 5 abnormalities (Mass, Nodule, Pneumonia, Edema, Fibrosis) with normal case scoring.

Result: MedSAM produces anatomically plausible lung masks. Masking effects are task/architecture-dependent: ResNet50 on original images best for abnormality discrimination; loose masking yields comparable macro AUROC but significantly improves No Finding discrimination; tight masking reduces abnormality performance but improves training efficiency.

Conclusion: Lung masking should be treated as a controllable spatial prior selected to match backbone architecture and clinical objective, rather than applied uniformly. Loose masking preserves perihilar/peripheral context and partially mitigates performance degradation.

Abstract: Chest X-ray (CXR) imaging is widely used for screening and diagnosing pulmonary abnormalities, yet automated interpretation remains challenging due to weak disease signals, dataset bias, and limited spatial supervision. Foundation models for medical image segmentation (MedSAM) provide an opportunity to introduce anatomically grounded priors that may improve robustness and interpretability in CXR analysis. We propose a segmentation-guided CXR classification pipeline that integrates MedSAM as a lung region extraction module prior to multi-label abnormality classification. MedSAM is fine-tuned using a public image-mask dataset from Airlangga University Hospital. We then apply it to a curated subset of the public NIH CXR dataset to train and evaluate deep convolutional neural networks for multi-label prediction of five abnormalities (Mass, Nodule, Pneumonia, Edema, and Fibrosis), with the normal case (No Finding) evaluated via a derived score. Experiments show that MedSAM produces anatomically plausible lung masks across diverse imaging conditions. We find that masking effects are both task-dependent and architecture-dependent. ResNet50 trained on original images achieves the strongest overall abnormality discrimination, while loose lung masking yields comparable macro AUROC but significantly improves No Finding discrimination, indicating a trade-off between abnormality-specific classification and normal case screening. Tight masking consistently reduces abnormality level performance but improves training efficiency. Loose masking partially mitigates this degradation by preserving perihilar and peripheral context. These results suggest that lung masking should be treated as a controllable spatial prior selected to match the backbone and clinical objective, rather than applied uniformly.

[224] Reverse Personalization

Han-Wei Kung, Tuomas Varanka, Nicu Sebe

Main category: cs.CV

TL;DR: A reverse personalization framework for face anonymization using conditional diffusion inversion without text prompts, enabling attribute-controllable identity removal while preserving image quality.

Details

Motivation: Existing prompt-based methods for identity removal/modification either require subjects to be well-represented in pre-trained models or need fine-tuning for specific identities, lacking control over facial attributes during anonymization.

Method: Uses conditional diffusion inversion for direct image manipulation without text prompts, incorporates identity-guided conditioning branch to generalize beyond training data subjects, and enables attribute-controllable anonymization.

Result: Achieves state-of-the-art balance between identity removal, attribute preservation, and image quality, outperforming prior anonymization methods.

Conclusion: The reverse personalization framework provides an effective solution for face anonymization with attribute control, generalizing to unseen identities without requiring fine-tuning or text prompts.

Abstract: Recent text-to-image diffusion models have demonstrated remarkable generation of realistic facial images conditioned on textual prompts and human identities, enabling creating personalized facial imagery. However, existing prompt-based methods for removing or modifying identity-specific features rely either on the subject being well-represented in the pre-trained model or require model fine-tuning for specific identities. In this work, we analyze the identity generation process and introduce a reverse personalization framework for face anonymization. Our approach leverages conditional diffusion inversion, allowing direct manipulation of images without using text prompts. To generalize beyond subjects in the model’s training data, we incorporate an identity-guided conditioning branch. Unlike prior anonymization methods, which lack control over facial attributes, our framework supports attribute-controllable anonymization. We demonstrate that our method achieves a state-of-the-art balance between identity removal, attribute preservation, and image quality. Source code and data are available at https://github.com/hanweikung/reverse-personalization .

[225] A Low-Cost UAV Deep Learning Pipeline for Integrated Apple Disease Diagnosis,Freshness Assessment, and Fruit Detection

Soham Dutta, Soham Banerjee, Sneha Mahata, Anindya Sen, Sayantani Datta

Main category: cs.CV

TL;DR: A unified RGB-only UAV pipeline for apple orchards that performs leaf disease detection, fruit freshness assessment, and yield estimation using deep learning models on low-cost hardware.

Details

Motivation: Apple orchards need integrated solutions for disease detection, quality assessment, and yield estimation, but existing UAV systems are fragmented, rely on expensive multispectral sensors, and often require cloud connectivity.

Method: Uses ResNet50 for leaf disease detection, VGG16 for apple freshness determination, and YOLOv8 for real-time apple detection/localization. Runs on ESP32-CAM and Raspberry Pi for fully offline on-site inference without cloud support.

Result: Achieved 98.9% accuracy for leaf disease classification, 97.4% accuracy for freshness classification, and 0.857 F1 score for apple detection. Provides accessible alternative to multispectral UAV solutions.

Conclusion: The unified RGB-only pipeline offers a practical, scalable, and affordable precision agriculture solution that integrates multiple orchard management tasks on low-cost hardware without cloud dependency.

Abstract: Apple orchards require timely disease detection, fruit quality assessment, and yield estimation, yet existing UAV-based systems address such tasks in isolation and often rely on costly multispectral sensors. This paper presents a unified, low-cost RGB-only UAV-based orchard intelligent pipeline integrating ResNet50 for leaf disease detection, VGG 16 for apple freshness determination, and YOLOv8 for real-time apple detection and localization. The system runs on an ESP32-CAM and Raspberry Pi, providing fully offline on-site inference without cloud support. Experiments demonstrate 98.9% accuracy for leaf disease classification, 97.4% accuracy for freshness classification, and 0.857 F1 score for apple detection. The framework provides an accessible and scalable alternative to multispectral UAV solutions, supporting practical precision agriculture on affordable hardware.

[226] PathoSyn: Imaging-Pathology MRI Synthesis via Disentangled Deviation Diffusion

Jian Wang, Sixing Rong, Jiarui Xing, Yuling Xu, Weide Liu

Main category: cs.CV

TL;DR: PathoSyn is a unified generative framework for MRI synthesis that disentangles anatomical structure from pathological deviations using a deviation-space diffusion model to produce high-fidelity synthetic medical images.

Details

Motivation: Current generative models for MRI synthesis suffer from feature entanglement when operating in global pixel domains or using binary masks, leading to corrupted anatomical structures and structural discontinuities in synthetic images.

Method: Decomposes synthesis into deterministic anatomical reconstruction and stochastic deviation modeling using a Deviation-Space Diffusion Model that learns conditional distributions of pathological residuals. Includes seam-aware fusion strategy and inference-time stabilization module for spatial coherence.

Result: Quantitatively and qualitatively outperforms holistic diffusion and mask-conditioned baselines on tumor imaging benchmarks in both perceptual realism and anatomical fidelity.

Conclusion: Provides a mathematically principled pipeline for generating patient-specific synthetic datasets, enabling robust diagnostic algorithm development, interpretable counterfactual disease progression modeling, and precision intervention planning.

Abstract: We present PathoSyn, a unified generative framework for Magnetic Resonance Imaging (MRI) image synthesis that reformulates imaging-pathology as a disentangled additive deviation on a stable anatomical manifold. Current generative models typically operate in the global pixel domain or rely on binary masks, these paradigms often suffer from feature entanglement, leading to corrupted anatomical substrates or structural discontinuities. PathoSyn addresses these limitations by decomposing the synthesis task into deterministic anatomical reconstruction and stochastic deviation modeling. Central to our framework is a Deviation-Space Diffusion Model designed to learn the conditional distribution of pathological residuals, thereby capturing localized intensity variations while preserving global structural integrity by construction. To ensure spatial coherence, the diffusion process is coupled with a seam-aware fusion strategy and an inference-time stabilization module, which collectively suppress boundary artifacts and produce high-fidelity internal lesion heterogeneity. PathoSyn provides a mathematically principled pipeline for generating high-fidelity patient-specific synthetic datasets, facilitating the development of robust diagnostic algorithms in low-data regimes. By allowing interpretable counterfactual disease progression modeling, the framework supports precision intervention planning and provides a controlled environment for benchmarking clinical decision-support systems. Quantitative and qualitative evaluations on tumor imaging benchmarks demonstrate that PathoSyn significantly outperforms holistic diffusion and mask-conditioned baselines in both perceptual realism and anatomical fidelity. The source code of this work will be made publicly available.

[227] With Great Context Comes Great Prediction Power: Classifying Objects via Geo-Semantic Scene Graphs

Ciprian Constantinescu, Marius Leordeanu

Main category: cs.CV

TL;DR: A novel contextual object classification framework using Geo-Semantic Contextual Graphs (GSCG) that integrates depth estimation with panoptic/material segmentation to explicitly model scene context, achieving 73.4% accuracy and outperforming context-agnostic and LLM-based approaches.

Details

Motivation: Humans use rich scene context (spatial relationships, material properties, object co-occurrence) for object recognition, but most computational systems operate on isolated image regions, ignoring this vital contextual information.

Method: Constructs Geo-Semantic Contextual Graph (GSCG) from monocular images by integrating metric depth estimation with unified panoptic and material segmentation. Objects become nodes with geometric/chromatic/material attributes, spatial relationships become edges. Uses graph-based classifier that aggregates features from target object, immediate neighbors, and global scene context.

Result: Achieves 73.4% classification accuracy on COCO 2017, dramatically outperforming context-agnostic versions (38.4%), fine-tuned ResNet models (max 53.5%), and state-of-the-art multimodal LLM Llama 4 Scout (42.3%).

Conclusion: Explicitly structured and interpretable context modeling through GSCG significantly improves object recognition, demonstrating superiority over both traditional deep learning and modern LLM approaches.

Abstract: Humans effortlessly identify objects by leveraging a rich understanding of the surrounding scene, including spatial relationships, material properties, and the co-occurrence of other objects. In contrast, most computational object recognition systems operate on isolated image regions, devoid of meaning in isolation, thus ignoring this vital contextual information. This paper argues for the critical role of context and introduces a novel framework for contextual object classification. We first construct a Geo-Semantic Contextual Graph (GSCG) from a single monocular image. This rich, structured representation is built by integrating a metric depth estimator with a unified panoptic and material segmentation model. The GSCG encodes objects as nodes with detailed geometric, chromatic, and material attributes, and their spatial relationships as edges. This explicit graph structure makes the model’s reasoning process inherently interpretable. We then propose a specialized graph-based classifier that aggregates features from a target object, its immediate neighbors, and the global scene context to predict its class. Through extensive ablation studies, we demonstrate that our context-aware model achieves a classification accuracy of 73.4%, dramatically outperforming context-agnostic versions (as low as 38.4%). Furthermore, our GSCG-based approach significantly surpasses strong baselines, including fine-tuned ResNet models (max 53.5%) and a state-of-the-art multimodal Large Language Model (LLM), Llama 4 Scout, which, even when given the full image alongside a detailed description of objects, maxes out at 42.3%. These results on COCO 2017 train/val splits highlight the superiority of explicitly structured and interpretable context for object recognition tasks.

[228] Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion

Yi Zhou, Xuechao Zou, Shun Zhang, Kai Li, Shiying Wang, Jingming Chen, Congyan Lang, Tengfei Cao, Pin Tao, Yuanchun Shi

Main category: cs.CV

TL;DR: Co2S is a stable semi-supervised remote sensing image segmentation framework that fuses priors from CLIP and DINOv3 vision foundation models to mitigate pseudo-label drift through heterogeneous dual-student architecture with semantic co-guidance and feature fusion.

Details

Motivation: Semi-supervised remote sensing image segmentation suffers from pseudo-label drift, where confirmation bias leads to error accumulation during training, limiting its effectiveness in reducing annotation burden.

Method: Proposes Co2S with heterogeneous dual-student architecture using ViT-based models initialized with CLIP and DINOv3 priors. Introduces explicit-implicit semantic co-guidance mechanism using text embeddings and learnable queries, plus global-local feature collaborative fusion strategy to combine CLIP’s global context with DINOv3’s local details.

Result: Extensive experiments on six popular datasets show superior performance, consistently achieving leading results across various partition protocols and diverse scenarios.

Conclusion: Co2S effectively mitigates pseudo-label drift in semi-supervised remote sensing segmentation by synergistically fusing vision-language and self-supervised priors through innovative architectural and guidance mechanisms.

Abstract: Semi-supervised remote sensing (RS) image semantic segmentation offers a promising solution to alleviate the burden of exhaustive annotation, yet it fundamentally struggles with pseudo-label drift, a phenomenon where confirmation bias leads to the accumulation of errors during training. In this work, we propose Co2S, a stable semi-supervised RS segmentation framework that synergistically fuses priors from vision-language models and self-supervised models. Specifically, we construct a heterogeneous dual-student architecture comprising two distinct ViT-based vision foundation models initialized with pretrained CLIP and DINOv3 to mitigate error accumulation and pseudo-label drift. To effectively incorporate these distinct priors, an explicit-implicit semantic co-guidance mechanism is introduced that utilizes text embeddings and learnable queries to provide explicit and implicit class-level guidance, respectively, thereby jointly enhancing semantic consistency. Furthermore, a global-local feature collaborative fusion strategy is developed to effectively fuse the global contextual information captured by CLIP with the local details produced by DINOv3, enabling the model to generate highly precise segmentation results. Extensive experiments on six popular datasets demonstrate the superiority of the proposed method, which consistently achieves leading performance across various partition protocols and diverse scenarios. Project page is available at https://xavierjiezou.github.io/Co2S/.

[229] 3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds

Ryousuke Yamada, Kohsuke Ide, Yoshihiro Fukuhara, Hirokatsu Kataoka, Gilles Puy, Andrei Bursuc, Yuki M. Asano

Main category: cs.CV

TL;DR: LAM3C learns 3D representations from unlabeled videos without real 3D sensors, achieving state-of-the-art performance on indoor segmentation tasks using video-generated point clouds.

Details

Motivation: Collecting large-scale 3D scene scans is expensive and labor-intensive, so the paper explores whether 3D representations can be learned from unlabeled videos recorded without real 3D sensors.

Method: Proposes LAM3C framework with: 1) RoomTours dataset of 49,219 video-generated point clouds from web videos, 2) noise-regularized loss enforcing local geometric smoothness and feature stability under noisy point clouds, 3) Laplacian-aware multi-level 3D clustering with Sinkhorn-Knopp algorithm.

Result: Without using any real 3D scans, LAM3C achieves higher performance than previous self-supervised methods on indoor semantic and instance segmentation tasks.

Conclusion: Unlabeled videos represent an abundant source of data for 3D self-supervised learning, enabling effective 3D representation learning without expensive 3D sensor data collection.

Abstract: Despite recent progress in 3D self-supervised learning, collecting large-scale 3D scene scans remains expensive and labor-intensive. In this work, we investigate whether 3D representations can be learned from unlabeled videos recorded without any real 3D sensors. We present Laplacian-Aware Multi-level 3D Clustering with Sinkhorn-Knopp (LAM3C), a self-supervised framework that learns from video-generated point clouds from unlabeled videos. We first introduce RoomTours, a video-generated point cloud dataset constructed by collecting room-walkthrough videos from the web (e.g., real-estate tours) and generating 49,219 scenes using an off-the-shelf feed-forward reconstruction model. We also propose a noise-regularized loss that stabilizes representation learning by enforcing local geometric smoothness and ensuring feature stability under noisy point clouds. Remarkably, without using any real 3D scans, LAM3C achieves higher performance than the previous self-supervised methods on indoor semantic and instance segmentation. These results suggest that unlabeled videos represent an abundant source of data for 3D self-supervised learning.

[230] Video-BrowseComp: Benchmarking Agentic Video Research on Open Web

Zhengyang Liang, Yan Shu, Xiangrui Liu, Minghao Qin, Kaixin Liang, Paolo Rota, Nicu Sebe, Zheng Liu, Lizi Liao

Main category: cs.CV

TL;DR: Video-BrowseComp is a benchmark for agentic video reasoning requiring active web research and temporal video evidence verification, revealing current models’ heavy reliance on text proxies and poor performance in visual-grounding tasks.

Details

Motivation: Current video benchmarks focus on passive perception rather than active agentic research, creating a modality gap for dynamic web video content that requires temporal navigation and cross-referencing.

Method: Created Video-BrowseComp with 210 questions that enforce mandatory dependency on temporal visual evidence, requiring navigation of video timelines to verify external claims rather than relying on text search alone.

Result: State-of-the-art models like GPT-5.1 (with Search) achieve only 15.24% accuracy, showing heavy reliance on textual proxies and collapsing in metadata-sparse domains like sports and gameplay where visual grounding is essential.

Conclusion: Video-BrowseComp advances video reasoning beyond passive perception toward proactive agentic research, highlighting critical bottlenecks in current models’ ability to handle dynamic visual evidence verification.

Abstract: The evolution of autonomous agents is redefining information seeking, transitioning from passive retrieval to proactive, open-ended web research. However, while textual and static multimodal agents have seen rapid progress, a significant modality gap remains in processing the web’s most dynamic modality: video. Existing video benchmarks predominantly focus on passive perception, feeding curated clips to models without requiring external retrieval. They fail to evaluate agentic video research, which necessitates actively interrogating video timelines, cross-referencing dispersed evidence, and verifying claims against the open web. To bridge this gap, we present \textbf{Video-BrowseComp}, a challenging benchmark comprising 210 questions tailored for open-web agentic video reasoning. Unlike prior benchmarks, Video-BrowseComp enforces a mandatory dependency on temporal visual evidence, ensuring that answers cannot be derived solely through text search but require navigating video timelines to verify external claims. Our evaluation of state-of-the-art models reveals a critical bottleneck: even advanced search-augmented models like GPT-5.1 (w/ Search) achieve only 15.24% accuracy. Our analysis reveals that these models largely rely on textual proxies, excelling in metadata-rich domains (e.g., TV shows with plot summaries) but collapsing in metadata-sparse, dynamic environments (e.g., sports, gameplay) where visual grounding is essential. As the first open-web video research benchmark, Video-BrowseComp advances the field beyond passive perception toward proactive video reasoning.

[231] ForCM: Forest Cover Mapping from Multispectral Sentinel-2 Image by Integrating Deep Learning with Object-Based Image Analysis

Maisha Haque, Israt Jahan Ayshi, Sadaf M. Anis, Nahian Tasnim, Mithila Moontaha, Md. Sabbir Ahmed, Muhammad Iqbal Hossain, Mohammad Zavid Parvez, Subrata Chakraborty, Biswajeet Pradhan, Biswajit Banik

Main category: cs.CV

TL;DR: ForCM combines Object-Based Image Analysis with Deep Learning using Sentinel-2 imagery for improved forest cover mapping in the Amazon, achieving up to 95.64% accuracy.

Details

Motivation: To enhance forest cover mapping accuracy by integrating OBIA with deep learning models, overcoming limitations of traditional OBIA methods for better environmental monitoring and conservation.

Method: Proposes ForCM approach combining OBIA with DL models (UNet, UNet++, ResUNet, AttentionUNet, ResNet50-Segnet) using multispectral Sentinel-2 imagery of Amazon Rainforest, evaluated on three datasets with different band combinations.

Result: ForCM significantly improves mapping accuracy: ResUNet-OBIA achieves 94.54% overall accuracy, AttentionUNet-OBIA achieves 95.64%, compared to 92.91% with traditional OBIA alone.

Conclusion: Integration of OBIA with deep learning models (especially ResUNet and AttentionUNet) substantially enhances forest cover mapping accuracy, demonstrating the potential of free tools like QGIS for environmental monitoring despite their limitations.

Abstract: This research proposes “ForCM”, a novel approach to forest cover mapping that combines Object-Based Image Analysis (OBIA) with Deep Learning (DL) using multispectral Sentinel-2 imagery. The study explores several DL models, including UNet, UNet++, ResUNet, AttentionUNet, and ResNet50-Segnet, applied to high-resolution Sentinel-2 Level 2A satellite images of the Amazon Rainforest. The datasets comprise three collections: two sets of three-band imagery and one set of four-band imagery. After evaluation, the most effective DL models are individually integrated with the OBIA technique to enhance mapping accuracy. The originality of this work lies in evaluating different deep learning models combined with OBIA and comparing them with traditional OBIA methods. The results show that the proposed ForCM method improves forest cover mapping, achieving overall accuracies of 94.54 percent with ResUNet-OBIA and 95.64 percent with AttentionUNet-OBIA, compared to 92.91 percent using traditional OBIA. This research also demonstrates the potential of free and user-friendly tools such as QGIS for accurate mapping within their limitations, supporting global environmental monitoring and conservation efforts.

[232] Domain-Shift Immunity in Deep Deformable Registration via Local Feature Representations

Mingzhen Shao, Sarang Joshi

Main category: cs.CV

TL;DR: Deep deformable registration models are inherently robust to domain shift due to their reliance on local features, not global appearance, as demonstrated by UniReg framework.

Details

Motivation: To understand why learning-based deformable registration models show robustness to domain shift, challenging the common belief that they require large diverse datasets for robustness.

Method: Introduces UniReg, a universal registration framework that decouples feature extraction from deformation estimation using fixed pre-trained feature extractors and a UNet-based deformation network, trained on a single dataset.

Result: UniReg achieves robust cross-domain and multi-modal performance comparable to optimization-based methods despite single-dataset training. Analysis reveals conventional CNN failures under modality shift stem from dataset-induced biases in early convolutional layers.

Conclusion: Local feature consistency is the key driver of robustness in learning-based deformable registration, motivating backbone designs that preserve domain-invariant local features rather than requiring large diverse training datasets.

Abstract: Deep learning has advanced deformable image registration, surpassing traditional optimization-based methods in both accuracy and efficiency. However, learning-based models are widely believed to be sensitive to domain shift, with robustness typically pursued through large and diverse training datasets, without explaining the underlying mechanisms. In this work, we show that domain-shift immunity is an inherent property of deep deformable registration models, arising from their reliance on local feature representations rather than global appearance for deformation estimation. To isolate and validate this mechanism, we introduce UniReg, a universal registration framework that decouples feature extraction from deformation estimation using fixed, pre-trained feature extractors and a UNet-based deformation network. Despite training on a single dataset, UniReg exhibits robust cross-domain and multi-modal performance comparable to optimization-based methods. Our analysis further reveals that failures of conventional CNN-based models under modality shift originate from dataset-induced biases in early convolutional layers. These findings identify local feature consistency as the key driver of robustness in learning-based deformable registration and motivate backbone designs that preserve domain-invariant local features.

[233] Exploring Syn-to-Real Domain Adaptation for Military Target Detection

Jongoh Jeong, Youngjin Oh, Gyeongrae Nam, Jeongeun Lee, Kuk-Jin Yoon

Main category: cs.CV

TL;DR: Researchers propose using Unreal Engine to generate synthetic RGB data for military target detection, addressing the high cost of SAR data and lack of military datasets, and benchmark domain adaptation methods on synthetic-to-real transfer tasks.

Details

Motivation: Military object detection faces challenges with domain adaptation across varied environments, high costs of SAR data acquisition/processing, and lack of military target datasets. RGB cameras offer affordable alternatives but need synthetic data generation.

Method: Generate synthetic RGB data using Unreal Engine photorealistic visual tool for military target detection. Conduct synthetic-to-real transfer experiments by training on synthetic dataset and validating on web-collected real military target datasets. Benchmark state-of-the-art domain adaptation methods with varying supervision levels.

Result: Current domain adaptation methods using minimal image hints (e.g., object class) achieve substantial improvement over unsupervised or semi-supervised methods. The study identifies remaining challenges in cross-domain military target detection.

Conclusion: Synthetic data generation using Unreal Engine offers a viable approach for military target detection, but current domain adaptation methods still face challenges in handling the complexity of military domains with multiple varying target environments.

Abstract: Object detection is one of the key target tasks of interest in the context of civil and military applications. In particular, the real-world deployment of target detection methods is pivotal in the decision-making process during military command and reconnaissance. However, current domain adaptive object detection algorithms consider adapting one domain to another similar one only within the scope of natural or autonomous driving scenes. Since military domains often deal with a mixed variety of environments, detecting objects from multiple varying target domains poses a greater challenge. Several studies for armored military target detection have made use of synthetic aperture radar (SAR) data due to its robustness to all weather, long range, and high-resolution characteristics. Nevertheless, the costs of SAR data acquisition and processing are still much higher than those of the conventional RGB camera, which is a more affordable alternative with significantly lower data processing time. Furthermore, the lack of military target detection datasets limits the use of such a low-cost approach. To mitigate these issues, we propose to generate RGB-based synthetic data using a photorealistic visual tool, Unreal Engine, for military target detection in a cross-domain setting. To this end, we conducted synthetic-to-real transfer experiments by training our synthetic dataset and validating on our web-collected real military target datasets. We benchmark the state-of-the-art domain adaptation methods distinguished by the degree of supervision on our proposed train-val dataset pair, and find that current methods using minimal hints on the image (e.g., object class) achieve a substantial improvement over unsupervised or semi-supervised DA methods. From these observations, we recognize the current challenges that remain to be overcome.

[234] GeoTeacher: Geometry-Guided Semi-Supervised 3D Object Detection

Jingyu Li, Xiaolong Zhao, Zhe Liu, Wenxiao Wu, Li Zhang

Main category: cs.CV

TL;DR: GeoTeacher is a semi-supervised 3D object detection method that enhances geometric relation understanding through keypoint-based supervision and voxel-wise augmentation with distance-decay, achieving SOTA results on ONCE and Waymo datasets.

Details

Motivation: Previous semi-supervised 3D object detection methods focus on pseudo-label quality or feature consistency but overlook the model's low sensitivity to object geometries with limited labeled data. This geometric information is crucial for object perception and localization, especially when leveraging unlabeled data.

Method: 1) Keypoint-based geometric relation supervision module transfers teacher model’s object geometry knowledge to student. 2) Voxel-wise data augmentation strategy increases object geometry diversity with distance-decay mechanism to preserve distant object integrity. 3) Framework can be combined with existing SS3D methods.

Result: Extensive experiments on ONCE and Waymo datasets show effectiveness and generalization. Achieves new state-of-the-art results. Method can be combined with different SS3D methods to further improve their performance.

Conclusion: GeoTeacher successfully addresses the geometric relation learning challenge in semi-supervised 3D object detection by transferring geometric knowledge from teacher to student and enhancing geometric diversity through smart augmentation, leading to improved object perception and localization capabilities.

Abstract: Semi-supervised 3D object detection, aiming to explore unlabeled data for boosting 3D object detectors, has emerged as an active research area in recent years. Some previous methods have shown substantial improvements by either employing heterogeneous teacher models to provide high-quality pseudo labels or enforcing feature-perspective consistency between the teacher and student networks. However, these methods overlook the fact that the model usually tends to exhibit low sensitivity to object geometries with limited labeled data, making it difficult to capture geometric information, which is crucial for enhancing the student model’s ability in object perception and localization. In this paper, we propose GeoTeacher to enhance the student model’s ability to capture geometric relations of objects with limited training data, especially unlabeled data. We design a keypoint-based geometric relation supervision module that transfers the teacher model’s knowledge of object geometry to the student, thereby improving the student’s capability in understanding geometric relations. Furthermore, we introduce a voxel-wise data augmentation strategy that increases the diversity of object geometries, thereby further improving the student model’s ability to comprehend geometric structures. To preserve the integrity of distant objects during augmentation, we incorporate a distance-decay mechanism into this strategy. Moreover, GeoTeacher can be combined with different SS3D methods to further improve their performance. Extensive experiments on the ONCE and Waymo datasets indicate the effectiveness and generalization of our method and we achieve the new state-of-the-art results. Code will be available at https://github.com/SII-Whaleice/GeoTeacher

[235] Holi-DETR: Holistic Fashion Item Detection Leveraging Contextual Information

Youngchae Kwon, Jinyoung Choi, Injung Kim

Main category: cs.CV

TL;DR: Holi-DETR: A transformer-based fashion item detector that uses three types of contextual information (co-occurrence, spatial arrangements, and body keypoints) to improve detection accuracy by addressing ambiguities in fashion items.

Details

Motivation: Fashion item detection is challenging due to diverse appearances and similarities among subcategories. Conventional detectors treat items independently, missing important contextual relationships that could help resolve ambiguities.

Method: Proposes Holi-DETR, a novel Detection Transformer architecture that integrates three types of contextual information: (1) co-occurrence relationships between fashion items, (2) relative position and size based on inter-item spatial arrangements, and (3) spatial relationships between items and human body keypoints.

Result: The proposed method improved vanilla DETR by 3.6 percentage points (pp) and Co-DETR by 1.1 pp in terms of average precision (AP).

Conclusion: Holi-DETR successfully addresses fashion item detection challenges by holistically leveraging contextual information, demonstrating significant performance improvements over existing transformer-based detectors.

Abstract: Fashion item detection is challenging due to the ambiguities introduced by the highly diverse appearances of fashion items and the similarities among item subcategories. To address this challenge, we propose a novel Holistic Detection Transformer (Holi-DETR) that detects fashion items in outfit images holistically, by leveraging contextual information. Fashion items often have meaningful relationships as they are combined to create specific styles. Unlike conventional detectors that detect each item independently, Holi-DETR detects multiple items while reducing ambiguities by leveraging three distinct types of contextual information: (1) the co-occurrence relationship between fashion items, (2) the relative position and size based on inter-item spatial arrangements, and (3) the spatial relationships between items and human body key-points. %Holi-DETR explicitly incorporates three types of contextual information: (1) the co-occurrence probability between fashion items, (2) the relative position and size based on inter-item spatial arrangements, and (3) the spatial relationships between items and human body key-points. To this end, we propose a novel architecture that integrates these three types of heterogeneous contextual information into the Detection Transformer (DETR) and its subsequent models. In experiments, the proposed methods improved the performance of the vanilla DETR and the more recently developed Co-DETR by 3.6 percent points (pp) and 1.1 pp, respectively, in terms of average precision (AP).

[236] REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation

Fulin Shi, Wenyi Xiao, Bin Chen, Liang Din, Leilei Gan

Main category: cs.CV

TL;DR: REVEALER is a unified framework for fine-grained element-level alignment evaluation between text prompts and generated images using reinforcement-guided visual reasoning with MLLMs.

Details

Motivation: Existing evaluation methods for text-to-image alignment rely on coarse-grained metrics or static QA pipelines that lack fine-grained interpretability and struggle to reflect human preferences.

Method: Adopts a structured “grounding-reasoning-conclusion” paradigm using Multimodal Large Language Models to explicitly localize semantic elements and derive interpretable alignment judgments. Optimized via Group Relative Policy Optimization with composite reward function incorporating structural format, grounding accuracy, and alignment fidelity.

Result: Achieves state-of-the-art performance across four benchmarks (EvalMuse-40K, RichHF, MHaluBench, GenAI-Bench), consistently outperforming both strong proprietary models and supervised baselines while demonstrating superior inference efficiency compared to existing iterative visual reasoning methods.

Conclusion: REVEALER provides a unified, interpretable framework for fine-grained element-level alignment evaluation that effectively addresses limitations of existing methods and demonstrates superior performance and efficiency.

Abstract: Evaluating the alignment between textual prompts and generated images is critical for ensuring the reliability and usability of text-to-image (T2I) models. However, most existing evaluation methods rely on coarse-grained metrics or static QA pipelines, which lack fine-grained interpretability and struggle to reflect human preferences. To address this, we propose REVEALER, a unified framework for element-level alignment evaluation based on reinforcement-guided visual reasoning. Adopting a structured “grounding-reasoning-conclusion” paradigm, our method enables Multimodal Large Language Models (MLLMs) to explicitly localize semantic elements and derive interpretable alignment judgments. We optimize the model via Group Relative Policy Optimization(GRPO) using a composite reward function that incorporates structural format, grounding accuracy, and alignment fidelity. Extensive experiments across four benchmarks-EvalMuse-40K, RichHF, MHaluBench, and GenAI-Bench-demonstrate that REVEALER achieves state-of-the-art performance. Our approach consistently outperforms both strong proprietary models and supervised baselines while demonstrating superior inference efficiency compared to existing iterative visual reasoning methods.

[237] Anomaly Detection by Effectively Leveraging Synthetic Images

Sungho Kang, Hyunkyu Park, Yeonho Lee, Hanbyul Lee, Mijoo Jeong, YeongHyeon Park, Injae Lee, Juneho Yi

Main category: cs.CV

TL;DR: Proposes a framework using pre-trained text-guided image-to-image translation and image retrieval to efficiently generate realistic synthetic defect images for anomaly detection, with a two-stage training strategy to reduce costs while improving performance.

Details

Motivation: Anomaly detection in industrial manufacturing suffers from scarcity of real defect images. Existing synthesis approaches present a trade-off: rule-based methods are cost-effective but unrealistic, while generative models are high-quality but expensive. Need an efficient way to generate realistic synthetic defect images.

Method: Uses pre-trained text-guided image-to-image translation model to generate defect images, combined with image retrieval model to filter irrelevant outputs and enhance quality. Introduces two-stage training: pre-training on large volume of rule-based synthetic images, then fine-tuning on smaller set of high-quality generated images.

Result: Experiments on MVTec AD dataset demonstrate effectiveness of the approach. The method significantly reduces data collection costs while improving anomaly detection performance compared to previous synthesis strategies.

Conclusion: Proposed framework efficiently generates high-quality synthetic defect images by leveraging pre-trained models and image retrieval filtering, with two-stage training strategy that balances cost and performance for industrial anomaly detection.

Abstract: Anomaly detection plays a vital role in industrial manufacturing. Due to the scarcity of real defect images, unsupervised approaches that rely solely on normal images have been extensively studied. Recently, diffusion-based generative models brought attention to training data synthesis as an alternative solution. In this work, we focus on a strategy to effectively leverage synthetic images to maximize the anomaly detection performance. Previous synthesis strategies are broadly categorized into two groups, presenting a clear trade-off. Rule-based synthesis, such as injecting noise or pasting patches, is cost-effective but often fails to produce realistic defect images. On the other hand, generative model-based synthesis can create high-quality defect images but requires substantial cost. To address this problem, we propose a novel framework that leverages a pre-trained text-guided image-to-image translation model and image retrieval model to efficiently generate synthetic defect images. Specifically, the image retrieval model assesses the similarity of the generated images to real normal images and filters out irrelevant outputs, thereby enhancing the quality and relevance of the generated defect images. To effectively leverage synthetic images, we also introduce a two stage training strategy. In this strategy, the model is first pre-trained on a large volume of images from rule-based synthesis and then fine-tuned on a smaller set of high-quality images. This method significantly reduces the cost for data collection while improving the anomaly detection performance. Experiments on the MVTec AD dataset demonstrate the effectiveness of our approach.

[238] GVSynergy-Det: Synergistic Gaussian-Voxel Representations for Multi-View 3D Object Detection

Yi Zhang, Yi Wang, Lei Yao, Lap-Pui Chau

Main category: cs.CV

TL;DR: GVSynergy-Det is a novel image-based 3D object detection framework that synergistically combines Gaussian and voxel representations to achieve state-of-the-art performance without requiring depth sensors or dense 3D supervision.

Details

Motivation: Image-based 3D detection faces a trade-off: methods needing dense 3D supervision achieve high accuracy but are expensive, while unsupervised methods struggle with accurate geometry extraction from images alone. The authors aim to bridge this gap by leveraging complementary geometric representations.

Method: The framework uses a dual-representation architecture: 1) adapts generalizable Gaussian Splatting to extract fine-grained geometric features, and 2) develops a cross-representation enhancement mechanism that enriches voxel features with geometric details from Gaussian fields. Unlike previous approaches, it directly leverages features from both representations through learnable integration.

Result: GVSynergy-Det achieves state-of-the-art results on challenging indoor benchmarks, significantly outperforming existing methods on both ScanNetV2 and ARKitScenes datasets, all without requiring any depth or dense 3D geometry supervision.

Conclusion: The synergistic combination of continuous Gaussian and discrete voxel representations enables more accurate 3D object localization from RGB images alone, overcoming limitations of previous image-based approaches that either required dense supervision or struggled with geometry extraction.

Abstract: Image-based 3D object detection aims to identify and localize objects in 3D space using only RGB images, eliminating the need for expensive depth sensors required by point cloud-based methods. Existing image-based approaches face two critical challenges: methods achieving high accuracy typically require dense 3D supervision, while those operating without such supervision struggle to extract accurate geometry from images alone. In this paper, we present GVSynergy-Det, a novel framework that enhances 3D detection through synergistic Gaussian-Voxel representation learning. Our key insight is that continuous Gaussian and discrete voxel representations capture complementary geometric information: Gaussians excel at modeling fine-grained surface details while voxels provide structured spatial context. We introduce a dual-representation architecture that: 1) adapts generalizable Gaussian Splatting to extract complementary geometric features for detection tasks, and 2) develops a cross-representation enhancement mechanism that enriches voxel features with geometric details from Gaussian fields. Unlike previous methods that either rely on time-consuming per-scene optimization or utilize Gaussian representations solely for depth regularization, our synergistic strategy directly leverages features from both representations through learnable integration, enabling more accurate object localization. Extensive experiments demonstrate that GVSynergy-Det achieves state-of-the-art results on challenging indoor benchmarks, significantly outperforming existing methods on both ScanNetV2 and ARKitScenes datasets, all without requiring any depth or dense 3D geometry supervision (e.g., point clouds or TSDF).

[239] Physics-Inspired Modeling and Content Adaptive Routing in an Infrared Gas Leak Detection Network

Dongsheng Li, Chaobo Chen, Siling Wang, Song Gao

Main category: cs.CV

TL;DR: PEG-DRNet is a physics-edge hybrid network for infrared gas leak detection that combines gas transport modeling, edge-aware feature extraction, and adaptive routing to improve detection of faint, small gas plumes with weak boundaries.

Details

Motivation: Infrared gas leak detection is challenging because gas plumes are faint, small, semitransparent, and have weak, diffuse boundaries, making traditional detection methods ineffective.

Method: Three key components: 1) Gas Block with diffusion-convection modeling for local and global gas transport, 2) AGPEO edge operator with MSEPM for hierarchical edge features, 3) CASR-PAN with adaptive routing for selective feature propagation across scales based on edge and content cues.

Result: Achieves 29.8% overall AP, 84.3% AP50, and 25.3% small-object AP on IIG dataset, surpassing RT-DETR-R18 baseline by 3.0%, 6.5%, and 5.3% respectively, with only 43.7 Gflops and 14.9M parameters.

Conclusion: PEG-DRNet achieves superior performance with the best balance of accuracy and computational efficiency, outperforming existing CNN and Transformer detectors on both IIG and LangGas datasets.

Abstract: Detecting infrared gas leaks is critical for environmental monitoring and industrial safety, yet remains difficult because plumes are faint, small, semitransparent, and have weak, diffuse boundaries. We present physics-edge hybrid gas dynamic routing network (PEG-DRNet). First, we introduce the Gas Block, a diffusion-convection unit modeling gas transport: a local branch captures short-range variations, while a large-kernel branch captures long-range propagation. An edge-gated learnable fusion module balances local detail and global context, strengthening weak-contrast plume and contour cues. Second, we propose the adaptive gradient and phase edge operator (AGPEO), computing reliable edge priors from multi-directional gradients and phase-consistent responses. These are transformed by a multi-scale edge perception module (MSEPM) into hierarchical edge features that reinforce boundaries. Finally, the content-adaptive sparse routing path aggregation network (CASR-PAN), with adaptive information modulation modules for fusion and self, selectively propagates informative features across scales based on edge and content cues, improving cross-scale discriminability while reducing redundancy. Experiments on the IIG dataset show that PEG-DRNet achieves an overall AP of 29.8%, an AP${50}$ of 84.3%, and a small-object AP of 25.3%, surpassing the RT-DETR-R18 baseline by 3.0%, 6.5%, and 5.3%, respectively, while requiring only 43.7 Gflops and 14.9 M parameters. The proposed PEG-DRNet achieves superior overall performance with the best balance of accuracy and computational efficiency, outperforming existing CNN and Transformer detectors in AP and AP${50}$ on the IIG and LangGas dataset.

Tianchen Deng, Xuefeng Chen, Yi Chen, Qu Chen, Yuyao Xu, Lijin Yang, Le Xu, Yu Zhang, Bo Zhang, Wuxiong Huang, Hesheng Wang

Main category: cs.CV

TL;DR: A unified Driving World Model framework using 3D Gaussian scene representation that enables both 3D scene understanding and multi-modal generation through early modality alignment and language-guided sampling.

Details

Motivation: Existing Driving World Models lack 3D scene understanding capabilities and cannot interpret or reason about driving environments. Current approaches using point clouds or BEV features fail to accurately align textual information with 3D scenes, limiting their ability to understand and generate content meaningfully.

Method: Proposes a unified DWM framework based on 3D Gaussian scene representation that embeds linguistic features into each Gaussian primitive for early modality alignment. Introduces task-aware language-guided sampling to remove redundant Gaussians and inject compact 3D tokens into LLMs. Also designs a dual-condition multi-modal generation model combining high-level language conditions from vision-language models with low-level image conditions.

Result: Achieves state-of-the-art performance on nuScenes and NuInteract datasets, validating the framework’s effectiveness in 3D scene understanding and multi-modal generation tasks.

Conclusion: The proposed 3D Gaussian-based framework successfully addresses limitations of existing DWMs by enabling both scene understanding and generation through proper alignment of textual information with 3D scenes, offering a more comprehensive approach to driving environment modeling.

Abstract: Driving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi-modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early modality alignment. In addition, we design a novel task-aware language-guided sampling strategy that removes redundant 3D Gaussians and injects accurate and compact 3D tokens into LLM. Furthermore, we design a dual-condition multi-modal generation model, where the information captured by our vision-language model is leveraged as a high-level language condition in combination with a low-level image condition, jointly guiding the multi-modal generation process. We conduct comprehensive studies on the nuScenes, and NuInteract datasets to validate the effectiveness of our framework. Our method achieves state-of-the-art performance. We will release the code publicly on GitHub https://github.com/dtc111111/GaussianDWM.

[241] ViLaCD-R1: A Vision-Language Framework for Semantic Change Detection in Remote Sensing

Xingwei Ma, Shiyang Feng, Bo Zhang, Bin Wang

Main category: cs.CV

TL;DR: ViLaCD-R1 is a two-stage VLM-based framework for remote sensing change detection that uses a Multi-Image Reasoner and Mask-Guided Decoder to improve semantic change recognition and localization while suppressing non-semantic variations.

Details

Motivation: Traditional pixel-based and encoder-decoder methods inadequately capture high-level semantics and are vulnerable to non-semantic perturbations. Recent multimodal/VLM approaches still suffer from inaccurate spatial localization, imprecise boundary delineation, and limited interpretability.

Method: Two-stage framework: 1) Multi-Image Reasoner (VLM trained via SFT and RL on block-level dual-temporal inference tasks) takes image patches and outputs coarse change mask; 2) Mask-Guided Decoder integrates dual-temporal features with coarse mask to predict precise binary change map.

Result: Comprehensive evaluations on multiple RSCD benchmarks show ViLaCD-R1 substantially improves true semantic change recognition and localization, robustly suppresses non-semantic variations, and achieves state-of-the-art accuracy in complex real-world scenarios.

Conclusion: ViLaCD-R1 effectively addresses limitations of existing methods by combining VLM-based reasoning with mask-guided refinement, providing superior performance in remote sensing change detection tasks.

Abstract: Remote sensing change detection (RSCD), a complex multi-image inference task, traditionally uses pixel-based operators or encoder-decoder networks that inadequately capture high-level semantics and are vulnerable to non-semantic perturbations. Although recent multimodal and vision-language model (VLM)-based approaches enhance semantic understanding of change regions by incorporating textual descriptions, they still suffer from challenges such as inaccurate spatial localization, imprecise pixel-level boundary delineation, and limited interpretability. To address these issues, we propose ViLaCD-R1, a two-stage framework comprising a Multi-Image Reasoner (MIR) and a Mask-Guided Decoder (MGD). Specifically, the VLM is trained through supervised fine-tuning (SFT) and reinforcement learning (RL) on block-level dual-temporal inference tasks, taking dual-temporal image patches as input and outputting a coarse change mask. Then, the decoder integrates dual-temporal image features with this coarse mask to predict a precise binary change map. Comprehensive evaluations on multiple RSCD benchmarks demonstrate that ViLaCD-R1 substantially improves true semantic change recognition and localization, robustly suppresses non-semantic variations, and achieves state-of-the-art accuracy in complex real-world scenarios.

[242] Task-oriented Learnable Diffusion Timesteps for Universal Few-shot Learning of Dense Tasks

Changgyoon Oh, Jongoh Jeong, Jegyeong Cho, Kuk-Jin Yoon

Main category: cs.CV

TL;DR: The paper proposes a method to adaptively select and consolidate diffusion timestep features for few-shot dense prediction tasks, addressing the suboptimal performance from heuristic timestep selection.

Details

Motivation: Current diffusion models use heuristic selection of diffusion timestep features for single-task prediction, which relies on empirical intuition and leads to sub-optimal performance biased toward certain tasks. The paper aims to address this constraint by investigating versatile diffusion timestep features.

Method: Proposes two modules: 1) Task-aware Timestep Selection (TTS) to select ideal diffusion timesteps based on timestep-wise losses and similarity scores, and 2) Timestep Feature Consolidation (TFC) to consolidate selected timestep features. Uses parameter-efficient fine-tuning adapter for few-shot dense prediction.

Result: The framework achieves superiority in dense prediction performance given only a few support queries. Empirically validated on the large-scale challenging Taskonomy dataset for dense prediction in practical universal and few-shot learning scenarios.

Conclusion: The proposed learnable timestep consolidation method effectively addresses the limitations of heuristic timestep selection in diffusion models for few-shot dense prediction tasks, demonstrating improved performance through adaptive timestep feature selection and consolidation.

Abstract: Denoising diffusion probabilistic models have brought tremendous advances in generative tasks, achieving state-of-the-art performance thus far. Current diffusion model-based applications exploit the power of learned visual representations from multistep forward-backward Markovian processes for single-task prediction tasks by attaching a task-specific decoder. However, the heuristic selection of diffusion timestep features still heavily relies on empirical intuition, often leading to sub-optimal performance biased towards certain tasks. To alleviate this constraint, we investigate the significance of versatile diffusion timestep features by adaptively selecting timesteps best suited for the few-shot dense prediction task, evaluated on an arbitrary unseen task. To this end, we propose two modules: Task-aware Timestep Selection (TTS) to select ideal diffusion timesteps based on timestep-wise losses and similarity scores, and Timestep Feature Consolidation (TFC) to consolidate the selected timestep features to improve the dense predictive performance in a few-shot setting. Accompanied by our parameter-efficient fine-tuning adapter, our framework effectively achieves superiority in dense prediction performance given only a few support queries. We empirically validate our learnable timestep consolidation method on the large-scale challenging Taskonomy dataset for dense prediction, particularly for practical universal and few-shot learning scenarios.

[243] MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images

Md. Sazzadul Islam Prottasha, Nabil Walid Rafi

Main category: cs.CV

TL;DR: MedGemma fine-tuned with LoRA outperforms GPT-4 in medical image diagnosis, achieving 80.37% accuracy vs 69.58%, with better sensitivity for critical conditions like cancer and pneumonia.

Details

Motivation: To compare specialized open-source medical AI (MedGemma) vs proprietary multimodal LLMs (GPT-4) for medical image diagnosis, evaluating which approach better handles clinical implementation challenges like hallucinations.

Method: Comparative study between MedGemma-4b-it (fine-tuned using Low-Rank Adaptation) and GPT-4 for diagnosing six diseases. Used confusion matrices and classification reports for quantitative analysis across all disease categories.

Result: MedGemma achieved significantly higher mean test accuracy (80.37%) than GPT-4 (69.58%), with notably better sensitivity for high-stakes conditions like cancer and pneumonia detection.

Conclusion: Domain-specific fine-tuning is essential for minimizing hallucinations in clinical AI, positioning specialized models like MedGemma as superior tools for evidence-based medical reasoning compared to general multimodal LLMs.

Abstract: Multimodal Large Language Models (LLMs) introduce an emerging paradigm for medical imaging by interpreting scans through the lens of extensive clinical knowledge, offering a transformative approach to disease classification. This study presents a critical comparison between two fundamentally different AI architectures: the specialized open-source agent MedGemma and the proprietary large multimodal model GPT-4 for diagnosing six different diseases. The MedGemma-4b-it model, fine-tuned using Low-Rank Adaptation (LoRA), demonstrated superior diagnostic capability by achieving a mean test accuracy of 80.37% compared to 69.58% for the untuned GPT-4. Furthermore, MedGemma exhibited notably higher sensitivity in high-stakes clinical tasks, such as cancer and pneumonia detection. Quantitative analysis via confusion matrices and classification reports provides comprehensive insights into model performance across all categories. These results emphasize that domain-specific fine-tuning is essential for minimizing hallucinations in clinical implementation, positioning MedGemma as a sophisticated tool for complex, evidence-based medical reasoning.

[244] AVOID: The Adverse Visual Conditions Dataset with Obstacles for Driving Scene Understanding

Jongoh Jeong, Taek-Jin Song, Jong-Hwan Kim, Kuk-Jin Yoon

Main category: cs.CV

TL;DR: AVOID dataset for real-time obstacle detection under adverse conditions with semantic/depth maps, LiDAR data, and waypoints.

Details

Motivation: Existing datasets lack large-scale images with road obstacles captured under varying adverse conditions (weather, daylight) in the same visual domain as other classes, making it difficult to reliably detect unexpected small road hazards in real-time for self-driving cars.

Method: Introduce AVOID dataset collected in simulated environment with: 1) large set of unexpected road obstacles under various weather/time conditions, 2) semantic and depth maps, 3) raw and semantic LiDAR data, 4) waypoints. Benchmark on real-time networks and propose multi-task network for semantic segmentation, depth and waypoint prediction.

Result: New comprehensive dataset (AVOID) supports most visual perception tasks with multiple data modalities. Benchmarking results provided for obstacle detection, and ablation studies conducted for multi-task network.

Conclusion: AVOID dataset addresses limitations of existing datasets by providing diverse adverse condition data with multiple annotations, enabling better development of real-time obstacle detection systems for autonomous vehicles.

Abstract: Understanding road scenes for visual perception remains crucial for intelligent self-driving cars. In particular, it is desirable to detect unexpected small road hazards reliably in real-time, especially under varying adverse conditions (e.g., weather and daylight). However, existing road driving datasets provide large-scale images acquired in either normal or adverse scenarios only, and often do not contain the road obstacles captured in the same visual domain as for the other classes. To address this, we introduce a new dataset called AVOID, the Adverse Visual Conditions Dataset, for real-time obstacle detection collected in a simulated environment. AVOID consists of a large set of unexpected road obstacles located along each path captured under various weather and time conditions. Each image is coupled with the corresponding semantic and depth maps, raw and semantic LiDAR data, and waypoints, thereby supporting most visual perception tasks. We benchmark the results on high-performing real-time networks for the obstacle detection task, and also propose and conduct ablation studies using a comprehensive multi-task network for semantic segmentation, depth and waypoint prediction tasks.

[245] MM-UAVBench: How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios?

Shiqi Dai, Zizhi Ma, Zhicong Luo, Xuesong Yang, Yibin Huang, Wanyue Zhang, Chi Chen, Zonghao Guo, Wang Xu, Yufei Sun, Maosong Sun

Main category: cs.CV

TL;DR: MM-UAVBench: A comprehensive benchmark for evaluating Multimodal Large Language Models (MLLMs) in low-altitude UAV scenarios across perception, cognition, and planning capabilities.

Details

Motivation: Current MLLM benchmarks don't adequately cover low-altitude UAV scenarios, and existing UAV evaluations focus on specific tasks rather than assessing MLLMs' general intelligence in these complex environments.

Method: Created MM-UAVBench with 19 sub-tasks and over 5.7K manually annotated questions derived from real-world UAV data from public datasets. Evaluated 16 open-source and proprietary MLLMs across three core capability dimensions.

Result: Current MLLMs struggle to adapt to complex visual and cognitive demands of low-altitude scenarios. Analysis revealed critical bottlenecks including spatial bias and multi-view understanding limitations.

Conclusion: MM-UAVBench fills an important gap in evaluating MLLMs for UAV applications and will help foster research on robust and reliable MLLMs for real-world UAV intelligence.

Abstract: While Multimodal Large Language Models (MLLMs) have exhibited remarkable general intelligence across diverse domains, their potential in low-altitude applications dominated by Unmanned Aerial Vehicles (UAVs) remains largely underexplored. Existing MLLM benchmarks rarely cover the unique challenges of low-altitude scenarios, while UAV-related evaluations mainly focus on specific tasks such as localization or navigation, without a unified evaluation of MLLMs’general intelligence. To bridge this gap, we present MM-UAVBench, a comprehensive benchmark that systematically evaluates MLLMs across three core capability dimensions-perception, cognition, and planning-in low-altitude UAV scenarios. MM-UAVBench comprises 19 sub-tasks with over 5.7K manually annotated questions, all derived from real-world UAV data collected from public datasets. Extensive experiments on 16 open-source and proprietary MLLMs reveal that current models struggle to adapt to the complex visual and cognitive demands of low-altitude scenarios. Our analyses further uncover critical bottlenecks such as spatial bias and multi-view understanding that hinder the effective deployment of MLLMs in UAV scenarios. We hope MM-UAVBench will foster future research on robust and reliable MLLMs for real-world UAV intelligence.

[246] SURE Guided Posterior Sampling: Trajectory Correction for Diffusion-Based Inverse Problems

Minwoo Kim, Hongki Lim

Main category: cs.CV

TL;DR: SGPS uses SURE gradient updates and PCA noise estimation to correct diffusion sampling errors, enabling high-quality inverse problem reconstruction with <100 NFEs instead of hundreds/thousands.

Details

Motivation: Current diffusion-based inverse problem solving methods require hundreds/thousands of steps due to error accumulation from alternating diffusion sampling and data consistency steps, limiting practical efficiency.

Method: SURE Guided Posterior Sampling (SGPS) corrects sampling trajectory deviations using Stein’s Unbiased Risk Estimate (SURE) gradient updates and PCA-based noise estimation to mitigate noise-induced errors during early/middle sampling stages.

Result: SGPS maintains high reconstruction quality with fewer than 100 Neural Function Evaluations (NFEs), consistently outperforming existing methods at low NFE counts across diverse inverse problems.

Conclusion: SGPS enables more accurate posterior sampling with reduced error accumulation, making diffusion-based inverse problem solving more efficient and practical by significantly reducing computational requirements.

Abstract: Diffusion models have emerged as powerful learned priors for solving inverse problems. However, current iterative solving approaches which alternate between diffusion sampling and data consistency steps typically require hundreds or thousands of steps to achieve high quality reconstruction due to accumulated errors. We address this challenge with SURE Guided Posterior Sampling (SGPS), a method that corrects sampling trajectory deviations using Stein’s Unbiased Risk Estimate (SURE) gradient updates and PCA based noise estimation. By mitigating noise induced errors during the critical early and middle sampling stages, SGPS enables more accurate posterior sampling and reduces error accumulation. This allows our method to maintain high reconstruction quality with fewer than 100 Neural Function Evaluations (NFEs). Our extensive evaluation across diverse inverse problems demonstrates that SGPS consistently outperforms existing methods at low NFE counts.

[247] RS-Prune: Training-Free Data Pruning at High Ratios for Efficient Remote Sensing Diffusion Foundation Models

Fan Wei, Runmin Dong, Yushan Lai, Yixiang Yang, Zhaoyang Luo, Jinxiao Zhang, Miao Yang, Shuai Yuan, Jiyao Zhao, Bin Luo, Haohuan Fu

Main category: cs.CV

TL;DR: A training-free two-stage data pruning method for remote sensing diffusion foundation models that selects high-quality subsets under high pruning ratios, improving convergence and generation quality while maintaining diversity.

Details

Motivation: Current RS diffusion foundation models rely on large datasets with redundancy, noise, and class imbalance, which reduce training efficiency and prevent convergence. Existing approaches either aggregate multiple datasets or use simplistic deduplication, ignoring generation modeling requirements and RS imagery heterogeneity.

Method: Two-stage training-free data pruning: 1) Entropy-based criterion removes low-information samples, 2) Scene-aware clustering with stratified sampling using RS scene classification datasets as reference, balancing cluster-level uniformity and sample representativeness for fine-grained selection under high pruning ratios.

Result: Even after pruning 85% of training data, the method significantly improves convergence and generation quality. Models trained with this approach achieve state-of-the-art performance in downstream tasks like super-resolution and semantic image synthesis.

Conclusion: The proposed data pruning paradigm provides practical guidance for developing efficient RS generative foundation models that converge rapidly while maintaining versatility for generation, fine-tuning, and downstream applications.

Abstract: Diffusion-based remote sensing (RS) generative foundation models are cruial for downstream tasks. However, these models rely on large amounts of globally representative data, which often contain redundancy, noise, and class imbalance, reducing training efficiency and preventing convergence. Existing RS diffusion foundation models typically aggregate multiple classification datasets or apply simplistic deduplication, overlooking the distributional requirements of generation modeling and the heterogeneity of RS imagery. To address these limitations, we propose a training-free, two-stage data pruning approach that quickly select a high-quality subset under high pruning ratios, enabling a preliminary foundation model to converge rapidly and serve as a versatile backbone for generation, downstream fine-tuning, and other applications. Our method jointly considers local information content with global scene-level diversity and representativeness. First, an entropy-based criterion efficiently removes low-information samples. Next, leveraging RS scene classification datasets as reference benchmarks, we perform scene-aware clustering with stratified sampling to improve clustering effectiveness while reducing computational costs on large-scale unlabeled data. Finally, by balancing cluster-level uniformity and sample representativeness, the method enables fine-grained selection under high pruning ratios while preserving overall diversity and representativeness. Experiments show that, even after pruning 85% of the training data, our method significantly improves convergence and generation quality. Furthermore, diffusion foundation models trained with our method consistently achieve state-of-the-art performance across downstream tasks, including super-resolution and semantic image synthesis. This data pruning paradigm offers practical guidance for developing RS generative foundation models.

[248] Multimodal Interpretation of Remote Sensing Images: Dynamic Resolution Input Strategy and Multi-scale Vision-Language Alignment Mechanism

Siyu Zhang, Ying Chen, Lianlei Shan, Runhe Qiu

Main category: cs.CV

TL;DR: Proposes a Vision-language Model framework with Dynamic Resolution Input Strategy and Multi-scale Vision-language Alignment Mechanism for multimodal remote sensing fusion, improving both accuracy and computational efficiency.

Details

Motivation: To overcome limitations of single-source remote sensing data and address deficiencies in existing methods: fixed resolutions failing to balance efficiency/detail, and lack of semantic hierarchy in single-scale alignment.

Method: VLM framework with two innovations: 1) Dynamic Resolution Input Strategy (DRIS) using coarse-to-fine approach to adaptively allocate computational resources, 2) Multi-scale Vision-language Alignment Mechanism (MS-VLAM) with three-tier alignment covering object, local-region, and global levels.

Result: Significantly improves accuracy of semantic understanding and computational efficiency on RS-GPT4V dataset. Achieves superior performance in BLEU-4, CIDEr for image captioning, and R@10 for cross-modal retrieval compared to conventional methods.

Conclusion: Provides novel approach for constructing efficient and robust multimodal remote sensing systems, laying theoretical foundation and offering technical guidance for intelligent remote sensing interpretation engineering applications.

Abstract: Multimodal fusion of remote sensing images serves as a core technology for overcoming the limitations of single-source data and improving the accuracy of surface information extraction, which exhibits significant application value in fields such as environmental monitoring and urban planning. To address the deficiencies of existing methods, including the failure of fixed resolutions to balance efficiency and detail, as well as the lack of semantic hierarchy in single-scale alignment, this study proposes a Vision-language Model (VLM) framework integrated with two key innovations: the Dynamic Resolution Input Strategy (DRIS) and the Multi-scale Vision-language Alignment Mechanism (MS-VLAM).Specifically, the DRIS adopts a coarse-to-fine approach to adaptively allocate computational resources according to the complexity of image content, thereby preserving key fine-grained features while reducing redundant computational overhead. The MS-VLAM constructs a three-tier alignment mechanism covering object, local-region and global levels, which systematically captures cross-modal semantic consistency and alleviates issues of semantic misalignment and granularity imbalance.Experimental results on the RS-GPT4V dataset demonstrate that the proposed framework significantly improves the accuracy of semantic understanding and computational efficiency in tasks including image captioning and cross-modal retrieval. Compared with conventional methods, it achieves superior performance in evaluation metrics such as BLEU-4 and CIDEr for image captioning, as well as R@10 for cross-modal retrieval. This technical framework provides a novel approach for constructing efficient and robust multimodal remote sensing systems, laying a theoretical foundation and offering technical guidance for the engineering application of intelligent remote sensing interpretation.

[249] RAVEL: Rare Concept Generation and Editing via Graph-driven Relational Guidance

Kavana Venkatesh, Yusuf Dalva, Ismini Lourentzou, Pinar Yanardag

Main category: cs.CV

TL;DR: RAVEL is a training-free framework that improves text-to-image generation for rare/complex concepts using graph-based retrieval-augmented generation and self-correction, outperforming SOTA methods across multiple benchmarks.

Details

Motivation: Current T2I diffusion models struggle with rare, complex, or culturally nuanced concepts due to training data limitations, creating a need for better methods to handle long-tail domains.

Method: RAVEL integrates graph-based retrieval-augmented generation (RAG) into diffusion pipelines to retrieve compositional, symbolic, and relational context from knowledge graphs. It also includes SRD, a self-correction module that iteratively updates prompts via multi-aspect alignment feedback.

Result: RAVEL consistently outperforms state-of-the-art methods across three new benchmarks (MythoBench, Rare-Concept-1K, NovelBench) in perceptual, alignment, and LLM-as-a-Judge metrics.

Conclusion: RAVEL establishes a robust paradigm for controllable and interpretable T2I generation in long-tail domains, offering a model-agnostic framework compatible with leading diffusion models like Stable Diffusion XL, Flux, and DALL-E 3.

Abstract: Despite impressive visual fidelity, current text-to-image (T2I) diffusion models struggle to depict rare, complex, or culturally nuanced concepts due to training data limitations. We introduce RAVEL, a training-free framework that significantly improves rare concept generation, context-driven image editing, and self-correction by integrating graph-based retrieval-augmented generation (RAG) into diffusion pipelines. Unlike prior RAG and LLM-enhanced methods reliant on visual exemplars, static captions or pre-trained knowledge of models, RAVEL leverages structured knowledge graphs to retrieve compositional, symbolic, and relational context, enabling nuanced grounding even in the absence of visual priors. To further refine generation quality, we propose SRD, a novel self-correction module that iteratively updates prompts via multi-aspect alignment feedback, enhancing attribute accuracy, narrative coherence, and semantic fidelity. Our framework is model-agnostic and compatible with leading diffusion models including Stable Diffusion XL, Flux, and DALL-E 3. We conduct extensive evaluations across three newly proposed benchmarks - MythoBench, Rare-Concept-1K, and NovelBench. RAVEL also consistently outperforms SOTA methods across perceptual, alignment, and LLM-as-a-Judge metrics. These results position RAVEL as a robust paradigm for controllable and interpretable T2I generation in long-tail domains.

[250] ASemConsist: Adaptive Semantic Feature Control for Training-Free Identity-Consistent Generation

Shin seong Kim, Minjung Shin, Hyunin Cho, Youngjung Uh

Main category: cs.CV

TL;DR: ASemconsist is a novel framework for generating image sequences with consistent character identity while maintaining per-image prompt alignment, using selective text embedding modification and semantic control strategies.

Details

Motivation: Existing text-to-image diffusion models struggle to maintain consistent character identity across diverse scene descriptions while preserving per-image prompt alignment, creating a challenging trade-off.

Method: The framework uses selective text embedding modification for explicit semantic control over character identity, repurposes padding embeddings as semantic containers in FLUX, and employs adaptive feature-sharing that applies constraints only to ambiguous identity prompts.

Result: ASemconsist achieves state-of-the-art performance, effectively overcoming prior trade-offs between identity consistency and prompt alignment.

Conclusion: The paper introduces a comprehensive solution for character-consistent image sequence generation with a unified evaluation protocol (CQS) that captures performance imbalances between identity preservation and text alignment.

Abstract: Recent text-to-image diffusion models have significantly improved visual quality and text alignment. However, generating a sequence of images while preserving consistent character identity across diverse scene descriptions remains a challenging task. Existing methods often struggle with a trade-off between maintaining identity consistency and ensuring per-image prompt alignment. In this paper, we introduce a novel framework, ASemconsist, that addresses this challenge through selective text embedding modification, enabling explicit semantic control over character identity without sacrificing prompt alignment. Furthermore, based on our analysis of padding embeddings in FLUX, we propose a semantic control strategy that repurposes padding embeddings as semantic containers. Additionally, we introduce an adaptive feature-sharing strategy that automatically evaluates textual ambiguity and applies constraints only to the ambiguous identity prompt. Finally, we propose a unified evaluation protocol, the Consistency Quality Score (CQS), which integrates identity preservation and per-image text alignment into a single comprehensive metric, explicitly capturing performance imbalances between the two metrics. Our framework achieves state-of-the-art performance, effectively overcoming prior trade-offs. Project page: https://minjung-s.github.io/asemconsist

[251] SoulX-LiveTalk Technical Report

Le Shen, Qiao Qian, Tan Yu, Ke Zhou, Tianhang Yu, Yu Zhan, Zhenjie Wang, Ming Tao, Shunshun Yin, Siyuan Liu

Main category: cs.CV

TL;DR: SoulX-LiveTalk is a 14B-parameter framework for real-time, infinite-duration audio-driven avatar generation that achieves sub-second latency (0.87s) and 32 FPS throughput through bidirectional attention distillation and self-correction mechanisms.

Details

Motivation: Existing approaches for real-time audio-driven avatar generation compromise visual fidelity due to computational load vs. latency constraints, often using unidirectional attention or reduced model capacity.

Method: Uses Self-correcting Bidirectional Distillation to retain bidirectional attention within video chunks, Multi-step Retrospective Self-Correction Mechanism for stability during infinite generation, and full-stack inference acceleration with hybrid sequence parallelism, Parallel VAE, and kernel-level optimizations.

Result: First 14B-scale system to achieve sub-second start-up latency (0.87s) with real-time throughput of 32 FPS, setting new standard for high-fidelity interactive digital human synthesis.

Conclusion: SoulX-LiveTalk successfully addresses the engineering challenge of massive diffusion models for real-time avatar generation by balancing computational efficiency with visual fidelity through innovative bidirectional attention and self-correction mechanisms.

Abstract: Deploying massive diffusion models for real-time, infinite-duration, audio-driven avatar generation presents a significant engineering challenge, primarily due to the conflict between computational load and strict latency constraints. Existing approaches often compromise visual fidelity by enforcing strictly unidirectional attention mechanisms or reducing model capacity. To address this problem, we introduce \textbf{SoulX-LiveTalk}, a 14B-parameter framework optimized for high-fidelity real-time streaming. Diverging from conventional unidirectional paradigms, we use a \textbf{Self-correcting Bidirectional Distillation} strategy that retains bidirectional attention within video chunks. This design preserves critical spatiotemporal correlations, significantly enhancing motion coherence and visual detail. To ensure stability during infinite generation, we incorporate a \textbf{Multi-step Retrospective Self-Correction Mechanism}, enabling the model to autonomously recover from accumulated errors and preventing collapse. Furthermore, we engineered a full-stack inference acceleration suite incorporating hybrid sequence parallelism, Parallel VAE, and kernel-level optimizations. Extensive evaluations confirm that SoulX-LiveTalk is the first 14B-scale system to achieve a \textbf{sub-second start-up latency (0.87s)} while reaching a real-time throughput of \textbf{32 FPS}, setting a new standard for high-fidelity interactive digital human synthesis.

[252] Contour Information Aware 2D Gaussian Splatting for Image Representation

Masaya Takabe, Hiroshi Watanabe, Sujun Hong, Tomohiro Ikai, Zheming Fan, Ryo Ishimoto, Kakeru Sugimoto, Ruri Imichi

Main category: cs.CV

TL;DR: Contour-aware 2D Gaussian Splatting framework that uses object segmentation priors to preserve edge structures under high compression, preventing blurry boundaries when using few Gaussians.

Details

Motivation: Existing 2D Gaussian Splatting methods produce blurry or indistinct boundaries when using few Gaussians due to lack of contour awareness, limiting their effectiveness for compact image representation.

Method: Proposes a Contour Information-Aware 2D Gaussian Splatting framework that incorporates object segmentation priors, constrains each Gaussian to specific segmentation regions during rasterization to prevent cross-boundary blending, and introduces a warm-up scheme for training stability.

Result: Achieves higher reconstruction quality around object edges compared to existing 2DGS methods, particularly evident with very few Gaussians, while maintaining fast rendering and low memory usage on synthetic color charts and DAVIS dataset.

Conclusion: The proposed contour-aware framework effectively addresses boundary blurring in 2D Gaussian Splatting, enabling high-quality edge preservation under extreme compression while retaining the efficiency advantages of Gaussian-based image representation.

Abstract: Image representation is a fundamental task in computer vision. Recently, Gaussian Splatting has emerged as an efficient representation framework, and its extension to 2D image representation enables lightweight, yet expressive modeling of visual content. While recent 2D Gaussian Splatting (2DGS) approaches provide compact storage and real-time decoding, they often produce blurry or indistinct boundaries when the number of Gaussians is small due to the lack of contour awareness. In this work, we propose a Contour Information-Aware 2D Gaussian Splatting framework that incorporates object segmentation priors into Gaussian-based image representation. By constraining each Gaussian to a specific segmentation region during rasterization, our method prevents cross-boundary blending and preserves edge structures under high compression. We also introduce a warm-up scheme to stabilize training and improve convergence. Experiments on synthetic color charts and the DAVIS dataset demonstrate that our approach achieves higher reconstruction quality around object edges compared to existing 2DGS methods. The improvement is particularly evident in scenarios with very few Gaussians, while our method still maintains fast rendering and low memory usage.

[253] ICONS: Influence Consensus for Vision-Language Data Selection

Xindi Wu, Mengzhou Xia, Rulin Shao, Zhiwei Deng, Pang Wei Koh, Olga Russakovsky

Main category: cs.CV

TL;DR: ICONS is a gradient-based data selection method for vision-language instruction tuning that identifies consistently valuable examples across tasks using influence consensus, achieving near-full performance with only 20% of data.

Details

Motivation: Current vision-language instruction tuning uses large data mixtures with redundant information, increasing computational costs without proportional gains. Existing task-agnostic heuristics for data selection are ineffective across diverse tasks.

Method: ICONS uses first-order training dynamics to estimate each example’s influence on validation performance, then aggregates these influence estimates across tasks via majority voting to identify consistently valuable data points while mitigating score calibration and outlier sensitivity.

Result: Models trained on ICONS-selected 20% subsets achieve 98.6% (LLaVA-665K), 98.8% (CAMBRIAN-7M), and 99.8% (VISION-FLAN-186K) of full-dataset performance. The method generalizes to unseen tasks and model architectures.

Conclusion: ICONS enables robust and scalable data selection for diverse multitask mixtures, releasing three compact subsets (LLaVA-ICONS-133K, CAMBRIAN-ICONS-1.4M, VISION-FLAN-ICONS-37K) for efficient vision-language model development.

Abstract: Training vision-language models via instruction tuning relies on large data mixtures spanning diverse tasks and domains, yet these mixtures frequently include redundant information that increases computational costs without proportional gains. Existing methods typically rely on task-agnostic heuristics to estimate data importance, limiting their effectiveness across tasks. We introduce ICONS, a gradient-based Influence CONsensus approach for vision-language data Selection. Our method leverages first-order training dynamics to estimate each example’s influence on validation performance, then aggregates these estimates across tasks via majority voting. This cross-task consensus identifies consistently valuable data points while mitigating score calibration and outlier sensitivity, enabling robust and scalable data selection for diverse multitask mixtures. Models trained on our selected 20% data subset from LLAVA-665K (respectively: from CAMBRIAN-7M, from VISION-FLAN-186K) retain 98.6% (respectively: 98.8%, 99.8%) of full-dataset performance. We demonstrate that our selected data generalizes to unseen tasks and model architectures, and release three compact subsets LLAVA-ICONS-133K, CAMBRIAN-ICONS-1.4M, and VISION-FLAN-ICONS-37K for efficient vision-language model development.

[254] Plug-and-Play Fidelity Optimization for Diffusion Transformer Acceleration via Cumulative Error Minimization

Tong Shao, Yusen Fu, Guoying Sun, Jingde Kong, Zhuotao Tian, Jingyong Su

Main category: cs.CV

TL;DR: CEM is a fidelity-optimization plugin that uses cumulative error minimization to dynamically optimize caching strategies for Diffusion Transformers, improving generation fidelity without extra computational cost.

Details

Motivation: Diffusion Transformers (DiT) have slow inference due to iterative denoising. Existing caching-based acceleration methods suffer from computational errors, and their fixed caching strategies can't adapt to complex error variations during denoising.

Method: Proposes CEM (cumulative error minimization) plugin that: 1) predefines error to characterize model sensitivity to acceleration based on timesteps and cache intervals, 2) uses dynamic programming with cumulative error approximation to optimize caching strategies, minimizing caching error.

Result: CEM significantly improves generation fidelity of existing acceleration models across 9 generation models and quantized methods in 3 tasks. Outperforms original generation performance on FLUX.1-dev, PixArt-α, StableDiffusion1.5 and Hunyuan.

Conclusion: CEM is a model-agnostic, training-free plugin that dynamically optimizes caching strategies via cumulative error minimization, achieving better fidelity than fixed caching approaches without additional computational overhead.

Abstract: Although Diffusion Transformer (DiT) has emerged as a predominant architecture for image and video generation, its iterative denoising process results in slow inference, which hinders broader applicability and development. Caching-based methods achieve training-free acceleration, while suffering from considerable computational error. Existing methods typically incorporate error correction strategies such as pruning or prediction to mitigate it. However, their fixed caching strategy fails to adapt to the complex error variations during denoising, which limits the full potential of error correction. To tackle this challenge, we propose a novel fidelity-optimization plugin for existing error correction methods via cumulative error minimization, named CEM. CEM predefines the error to characterize the sensitivity of model to acceleration jointly influenced by timesteps and cache intervals. Guided by this prior, we formulate a dynamic programming algorithm with cumulative error approximation for strategy optimization, which achieves the caching error minimization, resulting in a substantial improvement in generation fidelity. CEM is model-agnostic and exhibits strong generalization, which is adaptable to arbitrary acceleration budgets. It can be seamlessly integrated into existing error correction frameworks and quantized models without introducing any additional computational overhead. Extensive experiments conducted on nine generation models and quantized methods across three tasks demonstrate that CEM significantly improves generation fidelity of existing acceleration models, and outperforms the original generation performance on FLUX.1-dev, PixArt-$α$, StableDiffusion1.5 and Hunyuan. The code will be made publicly available.

[255] YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection

Xu Lin, Jinlong Peng, Zhenye Gan, Jiawen Zhu, Jun Liu

Main category: cs.CV

TL;DR: YOLO-Master introduces instance-conditional adaptive computation for real-time object detection using Efficient Sparse Mixture-of-Experts to dynamically allocate resources based on scene complexity.

Details

Motivation: Current YOLO-like RTOD models use static dense computation that applies uniform processing to all inputs, causing computational redundancy on trivial scenes and suboptimal performance on complex ones due to misallocated resources.

Method: Proposes YOLO-Master framework with Efficient Sparse Mixture-of-Experts (ES-MoE) block that dynamically allocates computational resources per input. Uses lightweight dynamic routing network with diversity-enhancing objective to encourage complementary expert specialization, activating only relevant experts during inference.

Result: Achieves 42.4% AP with 1.62ms latency on MS COCO, outperforming YOLOv13-N by +0.8% mAP with 17.8% faster inference. Gains are most significant on challenging dense scenes while maintaining efficiency on typical inputs and real-time speed.

Conclusion: YOLO-Master successfully addresses the limitations of static computation in RTOD by introducing adaptive resource allocation, achieving superior accuracy-speed trade-off, especially for complex scenes, while preserving real-time performance.

Abstract: Existing Real-Time Object Detection (RTOD) methods commonly adopt YOLO-like architectures for their favorable trade-off between accuracy and speed. However, these models rely on static dense computation that applies uniform processing to all inputs, misallocating representational capacity and computational resources such as over-allocating on trivial scenes while under-serving complex ones. This mismatch results in both computational redundancy and suboptimal detection performance. To overcome this limitation, we propose YOLO-Master, a novel YOLO-like framework that introduces instance-conditional adaptive computation for RTOD. This is achieved through a Efficient Sparse Mixture-of-Experts (ES-MoE) block that dynamically allocates computational resources to each input according to its scene complexity. At its core, a lightweight dynamic routing network guides expert specialization during training through a diversity enhancing objective, encouraging complementary expertise among experts. Additionally, the routing network adaptively learns to activate only the most relevant experts, thereby improving detection performance while minimizing computational overhead during inference. Comprehensive experiments on five large-scale benchmarks demonstrate the superiority of YOLO-Master. On MS COCO, our model achieves 42.4% AP with 1.62ms latency, outperforming YOLOv13-N by +0.8% mAP and 17.8% faster inference. Notably, the gains are most pronounced on challenging dense scenes, while the model preserves efficiency on typical inputs and maintains real-time inference speed. Code will be available.

[256] Multi-Track Multimodal Learning on iMiGUE: Micro-Gesture and Emotion Recognition

Arman Martirosyan, Shahane Tigranyan, Maria Razzhivina, Artak Aslanyan, Nazgul Salikhova, Ilya Makarov, Andrey Savchenko, Aram Avetisyan

Main category: cs.CV

TL;DR: This paper presents multimodal frameworks for micro-gesture recognition and behavior-based emotion prediction using RGB video and skeletal pose data, achieving 2nd place in the MiGA 2025 Challenge for emotion prediction.

Details

Motivation: Micro-gesture recognition and behavior-based emotion prediction are challenging tasks requiring modeling of subtle, fine-grained human behaviors from video and skeletal data. The iMiGUE dataset provides a benchmark for these tasks.

Method: Two multimodal frameworks: 1) For micro-gesture classification, uses MViTv2-S for video embeddings and 2s-AGCN for skeletal embeddings, fused via Cross-Modal Token Fusion. 2) For emotion recognition, uses SwinFace for facial embeddings and MViTv2-S for contextual embeddings, fused via InterFusion module.

Result: The approach demonstrated robust performance on the iMiGUE dataset, achieving 2nd place in the behavior-based emotion prediction task of the MiGA 2025 Challenge.

Conclusion: Multimodal fusion of complementary visual modalities (RGB video, skeletal pose, facial features) effectively captures subtle human behaviors for both micro-gesture recognition and emotion prediction tasks.

Abstract: Micro-gesture recognition and behavior-based emotion prediction are both highly challenging tasks that require modeling subtle, fine-grained human behaviors, primarily leveraging video and skeletal pose data. In this work, we present two multimodal frameworks designed to tackle both problems on the iMiGUE dataset. For micro-gesture classification, we explore the complementary strengths of RGB and 3D pose-based representations to capture nuanced spatio-temporal patterns. To comprehensively represent gestures, video, and skeletal embeddings are extracted using MViTv2-S and 2s-AGCN, respectively. Then, they are integrated through a Cross-Modal Token Fusion module to combine spatial and pose information. For emotion recognition, our framework extends to behavior-based emotion prediction, a binary classification task identifying emotional states based on visual cues. We leverage facial and contextual embeddings extracted using SwinFace and MViTv2-S models and fuse them through an InterFusion module designed to capture emotional expressions and body gestures. Experiments conducted on the iMiGUE dataset, within the scope of the MiGA 2025 Challenge, demonstrate the robust performance and accuracy of our method in the behavior-based emotion prediction task, where our approach secured 2nd place.

[257] RefAV: Towards Planning-Centric Scenario Mining

Cainan Davidson, Deva Ramanan, Neehar Peri

Main category: cs.CV

TL;DR: RefAV introduces a vision-language approach for mining safety-critical driving scenarios from AV logs using natural language queries, with a dataset of 10K queries from Argoverse 2.

Details

Motivation: Traditional scenario mining from uncurated AV driving logs is error-prone and time-consuming, relying on hand-crafted queries. There's a need for more efficient methods to identify interesting and safety-critical scenarios using natural language descriptions.

Method: Revisits spatio-temporal scenario mining using vision-language models (VLMs) to detect described scenarios and localize them in time and space. Introduces RefAV dataset with 10,000 diverse natural language queries describing complex multi-agent interactions from Argoverse 2 Sensor dataset.

Result: Naively repurposing off-the-shelf VLMs yields poor performance, suggesting scenario mining presents unique challenges. The paper evaluates several referential multi-object trackers and presents empirical analysis of baselines.

Conclusion: Vision-language models show promise for scenario mining but require specialized approaches. The RefAV dataset and competition provide benchmarks for future research in natural language-based scenario detection for autonomous vehicles.

Abstract: Autonomous Vehicles (AVs) collect and pseudo-label terabytes of multi-modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety-critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error-prone and prohibitively time-consuming, often relying on hand-crafted structured queries. In this work, we revisit spatio-temporal scenario mining through the lens of recent vision-language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large-scale dataset of 10,000 diverse natural language queries that describe complex multi-agent interactions relevant to motion planning derived from 1000 driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi-object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off-the-shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges. Lastly, we discuss our recently held competition and share insights from the community. Our code and dataset are available at https://github.com/CainanD/RefAV/ and https://argoverse.github.io/user-guide/tasks/scenario_mining.html

[258] Fuzzy-Logic and Deep Learning for Environmental Condition-Aware Road Surface Classification

Mustafa Demetgul, Sanja Lazarova Molnar

Main category: cs.CV

TL;DR: Real-time road surface monitoring system using mobile phone camera data and deep learning achieves over 95% accuracy in classifying 5 road condition types, with additional fuzzy logic for weather/time classification.

Details

Motivation: Classical road monitoring methods are expensive and unsystematic, requiring time for measurements. There's a need for real-time, cost-effective road surface monitoring to support vehicle planning and active control systems.

Method: Collected data using mobile phone camera on roads around KIT campus. Tested multiple deep learning algorithms (AlexNet, LeNet, VGG, ResNet) for road classification from images. Also used road acceleration data converted to images. Compared acceleration-based vs camera image-based approaches. Proposed fuzzy logic for weather/time classification.

Result: Achieved over 95% accuracy for classifying 5 road condition classes: asphalt, damaged asphalt, gravel road, damaged gravel road, pavement road. Compared performance of different deep learning architectures.

Conclusion: Proposed real-time system effectively monitors road surfaces using mobile phone data and deep learning, with high accuracy. Additional fuzzy logic approach can classify road surfaces based on weather and time of day.

Abstract: Monitoring states of road surfaces provides valuable information for the planning and controlling vehicles and active vehicle control systems. Classical road monitoring methods are expensive and unsystematic because they require time for measurements. This article proposes an real time system based on weather conditional data and road surface condition data. For this purpose, we collected data with a mobile phone camera on the roads around the campus of the Karlsruhe Institute of Technology. We tested a large number of different image-based deep learning algorithms for road classification. In addition, we used road acceleration data along with road image data for training by using them as images. We compared the performances of acceleration-based and camera image-based approaches. The performances of the simple Alexnet, LeNet, VGG, and Resnet algorithms were compared as deep learning algorithms. For road condition classification, 5 classes were considered: asphalt, damaged asphalt, gravel road, damaged gravel road, pavement road and over 95% accuracy performance was achieved. It is also proposed to use the acceleration or the camera image to classify the road surface according to the weather and the time of day using fuzzy logic.

[259] CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation

Ke Niu, Haiyang Yu, Zhuofan Chen, Zhengtao Yao, Weitao Jia, Xiaodong Ge, Jingqun Tang, Benlei Cui, Bin Li, Xiangyang Xue

Main category: cs.CV

TL;DR: CME-CAD is a novel training paradigm using heterogeneous collaborative multi-expert reinforcement learning to generate high-precision, editable CAD models from orthographic projections, addressing limitations of existing sketch-to-CAD methods.

Details

Motivation: Traditional CAD modeling is complex and hard to automate. Existing sketch-to-CAD methods produce non-editable, approximate models that lack the precision and editability required for industrial design. Text/image-based methods need manual annotation, limiting scalability in industrial settings.

Method: Proposes Heterogeneous Collaborative Multi-Expert Reinforcement Learning (CME-CAD) paradigm with two-stage training: 1) Multi-Expert Fine-Tuning (MEFT), and 2) Multi-Expert Reinforcement Learning (MERL). Integrates complementary strengths of different models for collaborative learning. Also introduces CADExpert benchmark with 17,299 instances including orthographic projections, dimension annotations, CoT processes, CADQuery code, and 3D models.

Result: The approach improves generation of accurate, constraint-compatible, and fully editable CAD models. The CADExpert benchmark provides comprehensive resources for training and evaluation.

Conclusion: CME-CAD addresses key challenges in automated CAD generation by combining collaborative learning with reinforcement learning, enabling production of industrial-grade editable CAD models while reducing manual annotation requirements.

Abstract: Computer-Aided Design (CAD) is essential in industrial design, but the complexity of traditional CAD modeling and workflows presents significant challenges for automating the generation of high-precision, editable CAD models. Existing methods that reconstruct 3D models from sketches often produce non-editable and approximate models that fall short of meeting the stringent requirements for precision and editability in industrial design. Moreover, the reliance on text or image-based inputs often requires significant manual annotation, limiting their scalability and applicability in industrial settings. To overcome these challenges, we propose the Heterogeneous Collaborative Multi-Expert Reinforcement Learning (CME-CAD) paradigm, a novel training paradigm for CAD code generation. Our approach integrates the complementary strengths of these models, facilitating collaborative learning and improving the model’s ability to generate accurate, constraint-compatible, and fully editable CAD models. We introduce a two-stage training process: Multi-Expert Fine-Tuning (MEFT), and Multi-Expert Reinforcement Learning (MERL). Additionally, we present CADExpert, an open-source benchmark consisting of 17,299 instances, including orthographic projections with precise dimension annotations, expert-generated Chain-of-Thought (CoT) processes, executable CADQuery code, and rendered 3D models.

[260] CoFi-Dec: Hallucination-Resistant Decoding via Coarse-to-Fine Generative Feedback in Large Vision-Language Models

Zongsheng Cao, Yangfan He, Anran Liu, Jun Xie, Feng Chen, Zepeng Wang

Main category: cs.CV

TL;DR: CoFi-Dec is a training-free decoding framework that reduces hallucinations in Large Vision-Language Models by using coarse-to-fine visual conditioning and generative self-feedback with Wasserstein-based fusion.

Details

Motivation: Large Vision-Language Models tend to produce hallucinated content inconsistent with visual inputs, limiting their reliability in real-world applications.

Method: Generates two intermediate textual responses conditioned on coarse- and fine-grained views of images, transforms them into synthetic images using text-to-image models, then uses Wasserstein-based fusion to align predictive distributions into consistent decoding trajectories.

Result: Substantially reduces both entity-level and semantic-level hallucinations across six hallucination-focused benchmarks, outperforming existing decoding strategies.

Conclusion: CoFi-Dec is an effective, model-agnostic framework that requires no additional training and can be seamlessly applied to various LVLMs to improve reliability.

Abstract: Large Vision-Language Models (LVLMs) have achieved impressive progress in multi-modal understanding and generation. However, they still tend to produce hallucinated content that is inconsistent with the visual input, which limits their reliability in real-world applications. We propose \textbf{CoFi-Dec}, a training-free decoding framework that mitigates hallucinations by integrating generative self-feedback with coarse-to-fine visual conditioning. Inspired by the human visual process from global scene perception to detailed inspection, CoFi-Dec first generates two intermediate textual responses conditioned on coarse- and fine-grained views of the original image. These responses are then transformed into synthetic images using a text-to-image model, forming multi-level visual hypotheses that enrich grounding cues. To unify the predictions from these multiple visual conditions, we introduce a Wasserstein-based fusion mechanism that aligns their predictive distributions into a geometrically consistent decoding trajectory. This principled fusion reconciles high-level semantic consistency with fine-grained visual grounding, leading to more robust and faithful outputs. Extensive experiments on six hallucination-focused benchmarks show that CoFi-Dec substantially reduces both entity-level and semantic-level hallucinations, outperforming existing decoding strategies. The framework is model-agnostic, requires no additional training, and can be seamlessly applied to a wide range of LVLMs. The implementation is available at https://github.com/AI-Researcher-Team/CoFi-Dec.

[261] Visual Language Hypothesis

Xiu Li

Main category: cs.CV

TL;DR: Visual representation learning requires semantic abstraction via non-homeomorphic transformations, not just smooth deformations, with models needing “expand-and-snap” topology changes to form discrete semantic states.

Details

Motivation: To understand visual representation learning from a structural/topological perspective, starting from the hypothesis that visual understanding requires a semantic language where many perceptual observations map to few discrete semantic states.

Method: Theoretical analysis using fiber bundle structures: visual space organized as fibers (nuisance variation) over quotient base space (semantics). Derives two consequences: 1) semantic quotient requires non-homeomorphic discriminative targets (labels, cross-instance identification, multimodal alignment), 2) models need “expand-and-snap” mechanism for topology change.

Result: Shows semantic invariance cannot be achieved through smooth deformation alone; requires explicit semantic equivalence. Model architectures must support topology changes to approximate semantic quotient structure.

Conclusion: Provides topological framework aligning with empirical regularities in large-scale discriminative/multimodal models and classical statistical learning principles. Results are interpretive, offering structural understanding of why certain learning approaches work.

Abstract: We study visual representation learning from a structural and topological perspective. We begin from a single hypothesis: that visual understanding presupposes a semantic language for vision, in which many perceptual observations correspond to a small number of discrete semantic states. Together with widely assumed premises on transferability and abstraction in representation learning, this hypothesis implies that the visual observation space must be organized in a fiber bundle like structure, where nuisance variation populates fibers and semantics correspond to a quotient base space. From this structure we derive two theoretical consequences. First, the semantic quotient $X/G$ is not a submanifold of $X$ and cannot be obtained through smooth deformation alone, semantic invariance requires a non-homeomorphic, discriminative target, for example, supervision via labels, cross instance identification, or multimodal alignment that supplies explicit semantic equivalence. Second, we show that approximating the quotient also places structural demands on the model architecture. Semantic abstraction requires not only an external semantic target, but a representation mechanism capable of supporting topology change: an expand-and-snap process in which the manifold is first geometrically expanded to separate structure and then collapsed to form discrete semantic regions. We emphasize that these results are interpretive rather than prescriptive: the framework provides a topological lens that aligns with empirical regularities observed in large-scale discriminative and multimodal models, and with classical principles in statistical learning theory.

[262] HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation

Yuxin Wen, Qing Shuai, Di Kang, Jing Li, Cheng Wen, Yue Qian, Ningxin Jiao, Changhai Chen, Weijie Chen, Yiran Wang, Jinkun Guo, Dongyue An, Han Liu, Yanyu Tong, Chao Zhang, Qing Guo, Juan Chen, Qiao Zhang, Youyi Zhang, Zihao Yao, Cheng Zhang, Hong Duan, Xiaoping Wu, Qi Chen, Fei Cheng, Liang Dong, Peng He, Hao Zhang, Jiaxin Lin, Chao Zhang, Zhongyi Fan, Yifan Li, Zhichao Hu, Yuhong Liu, Linus, Jie Jiang, Xiaolong Li, Linchao Bao

Main category: cs.CV

TL;DR: HY-Motion 1.0 is a billion-parameter Diffusion Transformer model for generating 3D human motions from text descriptions, achieving state-of-the-art performance through large-scale pretraining, fine-tuning, and reinforcement learning.

Details

Motivation: To advance 3D human motion generation by scaling up Diffusion Transformer models to billion-parameter scale and achieving superior instruction-following capabilities compared to existing open-source benchmarks.

Method: Full-stage training paradigm: 1) Large-scale pretraining on 3,000+ hours of motion data, 2) High-quality fine-tuning on 400 hours of curated data, 3) Reinforcement learning from human feedback and reward models, supported by rigorous data processing pipeline for motion cleaning and captioning.

Result: Achieves state-of-the-art performance with extensive coverage of over 200 motion categories across 6 major classes, significantly outperforming current open-source benchmarks in instruction-following capabilities.

Conclusion: HY-Motion 1.0 represents a major advancement in 3D human motion generation, successfully scaling DiT-based models to billion parameters and establishing a comprehensive training framework that enables precise text-motion alignment and high-quality motion generation.

Abstract: We present HY-Motion 1.0, a series of state-of-the-art, large-scale, motion generation models capable of generating 3D human motions from textual descriptions. HY-Motion 1.0 represents the first successful attempt to scale up Diffusion Transformer (DiT)-based flow matching models to the billion-parameter scale within the motion generation domain, delivering instruction-following capabilities that significantly outperform current open-source benchmarks. Uniquely, we introduce a comprehensive, full-stage training paradigm – including large-scale pretraining on over 3,000 hours of motion data, high-quality fine-tuning on 400 hours of curated data, and reinforcement learning from both human feedback and reward models – to ensure precise alignment with the text instruction and high motion quality. This framework is supported by our meticulous data processing pipeline, which performs rigorous motion cleaning and captioning. Consequently, our model achieves the most extensive coverage, spanning over 200 motion categories across 6 major classes. We release HY-Motion 1.0 to the open-source community to foster future research and accelerate the transition of 3D human motion generation models towards commercial maturity.

[263] CountGD++: Generalized Prompting for Open-World Counting

Niki Amini-Naieni, Andrew Zisserman

Main category: cs.CV

TL;DR: CountGD++ introduces novel prompt flexibility for object counting by enabling negative examples, automating visual annotations, and using synthetic images, leading to improved accuracy and generalization.

Details

Motivation: Existing object counting methods have limitations: they require manual annotation of visual examples, cannot specify what NOT to count, and lack flexibility in how target objects can be specified.

Method: Extends counting prompts to include negative examples via text/visual descriptions, introduces ‘pseudo-exemplars’ for automated annotation, accepts visual examples from both natural and synthetic images, and integrates CountGD++ as a vision expert agent for LLMs.

Result: Significant improvements in accuracy, efficiency, and generalization across multiple datasets, with expanded prompt flexibility for multi-modal open-world counting.

Conclusion: The proposed CountGD++ framework addresses key limitations in object counting by enabling more flexible prompt specifications, automated annotations, and integration with LLMs, advancing multi-modal open-world counting capabilities.

Abstract: The flexibility and accuracy of methods for automatically counting objects in images and videos are limited by the way the object can be specified. While existing methods allow users to describe the target object with text and visual examples, the visual examples must be manually annotated inside the image, and there is no way to specify what not to count. To address these gaps, we introduce novel capabilities that expand how the target object can be specified. Specifically, we extend the prompt to enable what not to count to be described with text and/or visual examples, introduce the concept of `pseudo-exemplars’ that automate the annotation of visual examples at inference, and extend counting models to accept visual examples from both natural and synthetic external images. We also use our new counting model, CountGD++, as a vision expert agent for an LLM. Together, these contributions expand the prompt flexibility of multi-modal open-world counting and lead to significant improvements in accuracy, efficiency, and generalization across multiple datasets. Code is available at https://github.com/niki-amini-naieni/CountGDPlusPlus.

[264] SpatialMosaic: A Multiview VLM Dataset for Partial Visibility

Kanghee Lee, Injae Lee, Minseok Kwak, Kwonyoung Ryu, Jungi Hong, Jaesik Park

Main category: cs.CV

TL;DR: SpatialMosaic: A scalable pipeline for generating 2M multi-view spatial reasoning QA pairs and a benchmark for evaluating VLMs on challenging real-world 3D scene understanding tasks.

Details

Motivation: Existing MLLM approaches for 3D scene understanding rely on pre-constructed 3D representations or reconstruction pipelines, limiting scalability and real-world applicability. Current methods learning spatial reasoning directly from multi-view images don't adequately address real-world challenges like partial visibility, occlusion, and low-overlap conditions that require reasoning from fragmented visual cues.

Method: 1) A scalable multi-view data generation and annotation pipeline that constructs realistic spatial reasoning QA pairs, creating SpatialMosaic dataset (2M QA pairs). 2) SpatialMosaic-Bench benchmark with 1M QA pairs across 6 tasks for evaluating multi-view spatial reasoning under realistic challenging scenarios. 3) SpatialMosaicVLM, a hybrid framework integrating 3D reconstruction models as geometry encoders within VLMs for robust spatial reasoning.

Result: Extensive experiments demonstrate that the proposed dataset and VQA tasks effectively enhance spatial reasoning under challenging multi-view conditions. The data generation pipeline successfully constructs realistic and diverse QA pairs that improve model performance on spatial reasoning tasks in challenging real-world scenarios.

Conclusion: The work addresses key limitations in current 3D scene understanding approaches by providing a scalable data generation pipeline, comprehensive benchmark, and hybrid VLM framework that enables robust spatial reasoning from fragmented visual cues in real-world environments with partial visibility and occlusion challenges.

Abstract: The rapid progress of Multimodal Large Language Models (MLLMs) has unlocked the potential for enhanced 3D scene understanding and spatial reasoning. However, existing approaches often rely on pre-constructed 3D representations or off-the-shelf reconstruction pipelines, which constrain scalability and real-world applicability. A recent line of work explores learning spatial reasoning directly from multi-view images, enabling Vision-Language Models (VLMs) to understand 3D scenes without explicit 3D reconstructions. Nevertheless, key challenges that frequently arise in real-world environments, such as partial visibility, occlusion, and low-overlap conditions that require spatial reasoning from fragmented visual cues, remain under-explored. To address these limitations, we propose a scalable multi-view data generation and annotation pipeline that constructs realistic spatial reasoning QAs, resulting in SpatialMosaic, a comprehensive instruction-tuning dataset featuring 2M QA pairs. We further introduce SpatialMosaic-Bench, a challenging benchmark for evaluating multi-view spatial reasoning under realistic and challenging scenarios, consisting of 1M QA pairs across 6 tasks. In addition, we present SpatialMosaicVLM, a hybrid framework that integrates 3D reconstruction models as geometry encoders within VLMs for robust spatial reasoning. Extensive experiments demonstrate that our proposed dataset and VQA tasks effectively enhance spatial reasoning under challenging multi-view conditions, validating the effectiveness of our data generation pipeline in constructing realistic and diverse QA pairs. Code and dataset will be available soon.

[265] Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion

Samuele Dell’Erba, Andrew D. Bagdanov

Main category: cs.CV

TL;DR: The paper proposes Optimization-based Visual Inversion (OVI) as a training-free, zero-shot alternative to computationally expensive diffusion priors for text-to-image generation, with novel constraints that improve visual fidelity.

Details

Motivation: Current diffusion models rely on computationally expensive prior networks that require massive training datasets. The authors challenge whether trained priors are necessary at all, seeking a training-free alternative.

Method: Proposes OVI which initializes latent visual representations from random pseudo-tokens and optimizes them to maximize cosine similarity with text embeddings. Introduces two novel constraints: Mahalanobis-based loss and Nearest-Neighbor loss to regularize optimization toward realistic image distributions.

Result: OVI serves as viable alternative to traditional priors. Analysis reveals critical flaw in current benchmarks where text embedding alone achieves high scores despite poor perceptual quality. Constrained OVI methods improve visual fidelity, with Nearest-Neighbor approach achieving scores comparable to state-of-the-art data-efficient prior.

Conclusion: Optimization-based strategies like OVI with proper constraints can serve as effective, training-free alternatives to traditional diffusion priors, challenging the necessity of computationally expensive trained priors in text-to-image generation.

Abstract: Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and zero-shot alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with the input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective. It achieves quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, underscoring the potential of optimization-based strategies as viable, training-free alternatives to traditional priors. The code will be publicly available upon acceptance.

[266] MGCA-Net: Multi-Graph Contextual Attention Network for Two-View Correspondence Learning

Shuyuan Lin, Mengtin Lo, Haosheng Chen, Yanjie Liang, Qiangqiang Wu

Main category: cs.CV

TL;DR: MGCA-Net improves two-view correspondence learning with contextual geometric attention and cross-stage multi-graph consensus modules, achieving state-of-the-art performance on outlier rejection and camera pose estimation.

Details

Motivation: Existing two-view correspondence learning methods have limitations in local geometric modeling and cross-stage information optimization, making it difficult to accurately capture geometric constraints and reducing model robustness.

Method: Proposes Multi-Graph Contextual Attention Network (MGCA-Net) with two key modules: Contextual Geometric Attention (CGA) that dynamically integrates spatial position and feature information via adaptive attention, and Cross-Stage Multi-Graph Consensus (CSMGC) that establishes geometric consensus via cross-stage sparse graph network.

Result: Experimental results on YFCC100M and SUN3D datasets show MGCA-Net significantly outperforms existing SOTA methods in outlier rejection and camera pose estimation tasks.

Conclusion: MGCA-Net effectively addresses limitations in geometric modeling and cross-stage optimization for two-view correspondence learning, providing improved robustness and performance for computer vision applications.

Abstract: Two-view correspondence learning is a key task in computer vision, which aims to establish reliable matching relationships for applications such as camera pose estimation and 3D reconstruction. However, existing methods have limitations in local geometric modeling and cross-stage information optimization, which make it difficult to accurately capture the geometric constraints of matched pairs and thus reduce the robustness of the model. To address these challenges, we propose a Multi-Graph Contextual Attention Network (MGCA-Net), which consists of a Contextual Geometric Attention (CGA) module and a Cross-Stage Multi-Graph Consensus (CSMGC) module. Specifically, CGA dynamically integrates spatial position and feature information via an adaptive attention mechanism and enhances the capability to capture both local and global geometric relationships. Meanwhile, CSMGC establishes geometric consensus via a cross-stage sparse graph network, ensuring the consistency of geometric information across different stages. Experimental results on two representative YFCC100M and SUN3D datasets show that MGCA-Net significantly outperforms existing SOTA methods in the outlier rejection and camera pose estimation tasks. Source code is available at http://www.linshuyuan.com.

[267] NeXT-IMDL: Build Benchmark for NeXT-Generation Image Manipulation Detection & Localization

Yifei Li, Haoyuan He, Yu Zheng, Bingyao Yu, Wenzhao Zheng, Lei Chen, Jie Zhou, Jiwen Lu

Main category: cs.CV

TL;DR: NeXT-IMDL is a diagnostic benchmark that exposes the fragility of current image manipulation detection models by systematically testing them across diverse AI-generated content scenarios, revealing significant performance degradation in real-world generalization settings.

Details

Motivation: The paper addresses the urgent need for robust image manipulation detection methods due to increasing accessibility and abuse risks of user-friendly image editing models. Current cross-dataset evaluation approaches create misleading impressions of progress by concealing model fragility when handling diverse AI-generated content.

Method: The authors propose NeXT-IMDL, a large-scale diagnostic benchmark that categorizes AI-generated content manipulations along four axes: editing models, manipulation types, content semantics, and forgery granularity. They implement five rigorous cross-dimension evaluation protocols to systematically probe generalization boundaries.

Result: Experiments on 11 representative models reveal that while models perform well in original settings, they exhibit systemic failures and significant performance degradation when evaluated under protocols simulating real-world generalization scenarios.

Conclusion: The paper provides a diagnostic toolkit and new findings to advance development of truly robust, next-generation image manipulation detection and localization models by exposing current limitations and establishing more realistic evaluation standards.

Abstract: The accessibility surge and abuse risks of user-friendly image editing models have created an urgent need for generalizable, up-to-date methods for Image Manipulation Detection and Localization (IMDL). Current IMDL research typically uses cross-dataset evaluation, where models trained on one benchmark are tested on others. However, this simplified evaluation approach conceals the fragility of existing methods when handling diverse AI-generated content, leading to misleading impressions of progress. This paper challenges this illusion by proposing NeXT-IMDL, a large-scale diagnostic benchmark designed not just to collect data, but to probe the generalization boundaries of current detectors systematically. Specifically, NeXT-IMDL categorizes AIGC-based manipulations along four fundamental axes: editing models, manipulation types, content semantics, and forgery granularity. Built upon this, NeXT-IMDL implements five rigorous cross-dimension evaluation protocols. Our extensive experiments on 11 representative models reveal a critical insight: while these models perform well in their original settings, they exhibit systemic failures and significant performance degradation when evaluated under our designed protocols that simulate real-world, various generalization scenarios. By providing this diagnostic toolkit and the new findings, we aim to advance the development towards building truly robust, next-generation IMDL models.

[268] AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization

Binhe Yu, Zhen Wang, Kexin Li, Yuqian Yuan, Wenqiao Zhang, Long Chen, Juncheng Li, Jun Xiao, Yueting Zhuang

Main category: cs.CV

TL;DR: AnyMS is a training-free framework for multi-subject image customization that uses layout guidance and attention decoupling to balance text alignment, subject identity preservation, and layout control without additional training.

Details

Motivation: Existing multi-subject customization methods struggle to balance text alignment, subject identity preservation, and layout control, while requiring additional training that limits scalability and efficiency.

Method: AnyMS uses a bottom-up dual-level attention decoupling mechanism: global decoupling separates text-visual attention for text alignment, and local decoupling confines each subject’s attention to its designated layout area to prevent conflicts. It employs pre-trained image adapters for feature extraction without subject learning.

Result: Extensive experiments show AnyMS achieves state-of-the-art performance, supports complex compositions, and scales to larger numbers of subjects while being training-free.

Conclusion: AnyMS provides an effective training-free solution for layout-guided multi-subject customization that successfully balances the three critical objectives and offers better scalability.

Abstract: Multi-subject customization aims to synthesize multiple user-specified subjects into a coherent image. To address issues such as subjects missing or conflicts, recent works incorporate layout guidance to provide explicit spatial constraints. However, existing methods still struggle to balance three critical objectives: text alignment, subject identity preservation, and layout control, while the reliance on additional training further limits their scalability and efficiency. In this paper, we present AnyMS, a novel training-free framework for layout-guided multi-subject customization. AnyMS leverages three input conditions: text prompt, subject images, and layout constraints, and introduces a bottom-up dual-level attention decoupling mechanism to harmonize their integration during generation. Specifically, global decoupling separates cross-attention between textual and visual conditions to ensure text alignment. Local decoupling confines each subject’s attention to its designated area, which prevents subject conflicts and thus guarantees identity preservation and layout control. Moreover, AnyMS employs pre-trained image adapters to extract subject-specific features aligned with the diffusion model, removing the need for subject learning or adapter tuning. Extensive experiments demonstrate that AnyMS achieves state-of-the-art performance, supporting complex compositions and scaling to a larger number of subjects.

[269] SOFTooth: Semantics-Enhanced Order-Aware Fusion for Tooth Instance Segmentation

Xiaolan Li, Wanquan Liu, Pengcheng Li, Pengyu Jie, Chenqiang Gao

Main category: cs.CV

TL;DR: SOFTooth is a 2D-3D fusion framework for 3D tooth instance segmentation that leverages frozen 2D SAM embeddings without 2D mask supervision, addressing challenges like boundary leakage, center drift, and inconsistent tooth identities.

Details

Motivation: 3D tooth instance segmentation faces challenges including crowded arches, ambiguous tooth-gingiva boundaries, missing teeth, and rare third molars. Native 3D methods suffer from boundary leakage and center drift, while 2D foundation models like SAM provide strong semantics but are impractical for direct 3D clinical workflows.

Method: SOFTooth uses three key components: 1) Point-wise residual gating module injects occlusal-view SAM embeddings into 3D point features to refine boundaries; 2) Center-guided mask refinement regularizes consistency between instance masks and geometric centroids; 3) Order-aware Hungarian matching integrates anatomical tooth order and center distance for coherent labeling under missing or crowded dentitions.

Result: On 3DTeethSeg'22 dataset, SOFTooth achieves state-of-the-art overall accuracy and mean IoU, with clear gains on cases involving third molars, demonstrating effective transfer of 2D semantics to 3D tooth instance segmentation without 2D fine-tuning.

Conclusion: Rich 2D semantics from foundation models like SAM can be effectively transferred to 3D tooth instance segmentation without 2D fine-tuning through the proposed SOFTooth framework, which addresses key challenges in dental segmentation through semantics-enhanced, order-aware 2D-3D fusion.

Abstract: Three-dimensional (3D) tooth instance segmentation remains challenging due to crowded arches, ambiguous tooth-gingiva boundaries, missing teeth, and rare yet clinically important third molars. Native 3D methods relying on geometric cues often suffer from boundary leakage, center drift, and inconsistent tooth identities, especially for minority classes and complex anatomies. Meanwhile, 2D foundation models such as the Segment Anything Model (SAM) provide strong boundary-aware semantics, but directly applying them in 3D is impractical in clinical workflows. To address these issues, we propose SOFTooth, a semantics-enhanced, order-aware 2D-3D fusion framework that leverages frozen 2D semantics without explicit 2D mask supervision. First, a point-wise residual gating module injects occlusal-view SAM embeddings into 3D point features to refine tooth-gingiva and inter-tooth boundaries. Second, a center-guided mask refinement regularizes consistency between instance masks and geometric centroids, reducing center drift. Furthermore, an order-aware Hungarian matching strategy integrates anatomical tooth order and center distance into similarity-based assignment, ensuring coherent labeling even under missing or crowded dentitions. On 3DTeethSeg'22, SOFTooth achieves state-of-the-art overall accuracy and mean IoU, with clear gains on cases involving third molars, demonstrating that rich 2D semantics can be effectively transferred to 3D tooth instance segmentation without 2D fine-tuning.

[270] Bridging Cognitive Gap: Hierarchical Description Learning for Artistic Image Aesthetics Assessment

Henglin Liu, Nisha Huang, Chang Liu, Jiangpeng Yan, Huijuan Huang, Jixuan Ying, Tong-Yee Lee, Pengfei Wan, Xiangyang Ji

Main category: cs.CV

TL;DR: ArtQuant framework for aesthetic quality assessment of artistic images using RAD dataset and LLM decoders to address data scarcity and model fragmentation challenges.

Details

Motivation: Aesthetic quality assessment is crucial for human-aligned AIGC evaluation, but faces challenges: (1) data scarcity/imbalance in existing datasets focusing only on visual perception, and (2) model fragmentation where current methods isolate aesthetic attributes or struggle with long-form textual descriptions.

Method: Proposes ArtQuant framework with two components: (1) RAD dataset - large-scale (70k) multi-dimensional structured dataset generated via iterative pipeline without heavy annotation, and (2) framework that couples aesthetic dimensions through joint description generation and uses LLM decoders to better model long-text semantics.

Result: Achieves state-of-the-art performance on several datasets while requiring only 33% of conventional training epochs. Theoretical analysis shows RAD’s semantic adequacy and generation paradigm collectively minimize prediction entropy, providing mathematical grounding.

Conclusion: The approach narrows the cognitive gap between artistic images and aesthetic judgment, with code and dataset to be released for future research. The symbiosis between data (RAD) and model (ArtQuant) effectively addresses fundamental challenges in aesthetic assessment.

Abstract: The aesthetic quality assessment task is crucial for developing a human-aligned quantitative evaluation system for AIGC. However, its inherently complex nature, spanning visual perception, cognition, and emotion, poses fundamental challenges. Although aesthetic descriptions offer a viable representation of this complexity, two critical challenges persist: (1) data scarcity and imbalance: existing dataset overly focuses on visual perception and neglects deeper dimensions due to the expensive manual annotation; and (2) model fragmentation: current visual networks isolate aesthetic attributes with multi-branch encoder, while multimodal methods represented by contrastive learning struggle to effectively process long-form textual descriptions. To resolve challenge (1), we first present the Refined Aesthetic Description (RAD) dataset, a large-scale (70k), multi-dimensional structured dataset, generated via an iterative pipeline without heavy annotation costs and easy to scale. To address challenge (2), we propose ArtQuant, an aesthetics assessment framework for artistic images which not only couples isolated aesthetic dimensions through joint description generation, but also better models long-text semantics with the help of LLM decoders. Besides, theoretical analysis confirms this symbiosis: RAD’s semantic adequacy (data) and generation paradigm (model) collectively minimize prediction entropy, providing mathematical grounding for the framework. Our approach achieves state-of-the-art performance on several datasets while requiring only 33% of conventional training epochs, narrowing the cognitive gap between artistic images and aesthetic judgment. We will release both code and dataset to support future research.

[271] PathFound: An Agentic Multimodal Model Activating Evidence-seeking Pathological Diagnosis

Shengyi Hua, Jianfeng Wu, Tianle Shen, Kangzhe Hu, Zhongzhen Huang, Shujuan Ni, Zhihong Zhang, Yuan Li, Zhe Wang, Xiaofan Zhang

Main category: cs.CV

TL;DR: PathFound is an agentic multimodal model for pathology that uses evidence-seeking inference to improve diagnostic accuracy by actively acquiring additional information when diagnoses are ambiguous.

Details

Motivation: Current pathological foundation models use static inference where whole-slide images are processed once without reassessment, unlike clinical workflows where pathologists refine hypotheses through repeated observations and additional examinations when diagnoses are unclear.

Method: PathFound integrates pathological visual foundation models, vision-language models, and reasoning models trained with reinforcement learning. It performs proactive information acquisition through three stages: initial diagnosis, evidence-seeking, and final decision.

Result: The evidence-seeking strategy consistently improves diagnostic accuracy across several large multimodal models. PathFound achieves state-of-the-art diagnostic performance across diverse clinical scenarios and demonstrates strong potential to discover subtle pathological details like nuclear features and local invasions.

Conclusion: Evidence-seeking workflows are effective in computational pathology, and PathFound represents a significant advancement by mimicking clinical diagnostic workflows through active information acquisition and hypothesis refinement.

Abstract: Recent pathological foundation models have substantially advanced visual representation learning and multimodal interaction. However, most models still rely on a static inference paradigm in which whole-slide images are processed once to produce predictions, without reassessment or targeted evidence acquisition under ambiguous diagnoses. This contrasts with clinical diagnostic workflows that refine hypotheses through repeated slide observations and further examination requests. We propose PathFound, an agentic multimodal model designed to support evidence-seeking inference in pathological diagnosis. PathFound integrates the power of pathological visual foundation models, vision-language models, and reasoning models trained with reinforcement learning to perform proactive information acquisition and diagnosis refinement by progressing through the initial diagnosis, evidence-seeking, and final decision stages. Across several large multimodal models, adopting this strategy consistently improves diagnostic accuracy, indicating the effectiveness of evidence-seeking workflows in computational pathology. Among these models, PathFound achieves state-of-the-art diagnostic performance across diverse clinical scenarios and demonstrates strong potential to discover subtle details, such as nuclear features and local invasions.

[272] DriveLaW:Unifying Planning and Video Generation in a Latent Driving World

Tianze Xia, Yongkang Li, Lijun Zhou, Jingfeng Yao, Kaixin Xiong, Haiyang Sun, Bing Wang, Kun Ma, Hangjun Ye, Wenyu Liu, Xinggang Wang

Main category: cs.CV

TL;DR: DriveLaW is a unified paradigm that integrates video generation (world modeling) and motion planning for autonomous driving, achieving state-of-the-art performance in both tasks through latent representation sharing and progressive training.

Details

Motivation: Current autonomous driving approaches treat world modeling and motion planning as decoupled processes, limiting their effectiveness. The authors aim to bridge this gap by creating a truly unified architecture that ensures consistency between future scene generation and trajectory planning.

Method: DriveLaW consists of two core components: DriveLaW-Video (a world model for high-fidelity forecasting with expressive latent representations) and DriveLaW-Act (a diffusion planner that generates trajectories from DriveLaW-Video’s latent representations). Both components are optimized using a three-stage progressive training strategy.

Result: DriveLaW achieves state-of-the-art results in both video prediction and motion planning. It surpasses the best-performing work by 33.3% in FID and 1.8% in FVD for video prediction, and sets a new record on the NAVSIM planning benchmark.

Conclusion: The unified paradigm of DriveLaW demonstrates that directly integrating world modeling and motion planning through shared latent representations leads to superior performance in autonomous driving, addressing the limitations of decoupled approaches and advancing both video generation and planning capabilities.

Abstract: World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. The power of our unified paradigm is demonstrated by new state-of-the-art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.

[273] Direct Diffusion Score Preference Optimization via Stepwise Contrastive Policy-Pair Supervision

Dohyun Kim, Seungwoo Lyu, Seung Wook Kim, Paul Hongsuck Seo

Main category: cs.CV

TL;DR: DDSPO is a new preference optimization method for diffusion models that provides per-timestep supervision using automatically generated preference signals from pretrained models, improving text-image alignment without costly human annotations.

Details

Motivation: Diffusion models struggle with nuanced user intent alignment and consistent aesthetic quality. Existing preference-based methods like DDPO require expensive human-labeled datasets which are costly and potentially noisy.

Method: Direct Diffusion Score Preference Optimization (DDSPO) directly derives per-timestep supervision from winning and losing policies when available. It avoids labeled data by automatically generating preference signals using a pretrained reference model, contrasting outputs conditioned on original prompts versus semantically degraded variants.

Result: DDSPO improves text-image alignment and visual quality, outperforming or matching existing preference-based methods while requiring significantly less supervision.

Conclusion: DDSPO provides an effective score-space preference supervision approach without explicit reward modeling or manual annotations, offering a practical strategy for improving diffusion model alignment.

Abstract: Diffusion models have achieved impressive results in generative tasks such as text-to-image synthesis, yet they often struggle to fully align outputs with nuanced user intent and maintain consistent aesthetic quality. Existing preference-based training methods like Diffusion Direct Preference Optimization help address these issues but rely on costly and potentially noisy human-labeled datasets. In this work, we introduce Direct Diffusion Score Preference Optimization (DDSPO), which directly derives per-timestep supervision from winning and losing policies when such policies are available. Unlike prior methods that operate solely on final samples, DDSPO provides dense, transition-level signals across the denoising trajectory. In practice, we avoid reliance on labeled data by automatically generating preference signals using a pretrained reference model: we contrast its outputs when conditioned on original prompts versus semantically degraded variants. This practical strategy enables effective score-space preference supervision without explicit reward modeling or manual annotations. Empirical results demonstrate that DDSPO improves text-image alignment and visual quality, outperforming or matching existing preference-based methods while requiring significantly less supervision. Our implementation is available at: https://dohyun-as.github.io/DDSPO

[274] RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature

Hanzheng Li, Xi Fang, Yixuan Li, Chaozheng Huang, Junjie Wang, Xi Wang, Hongzhe Bai, Bojun Hao, Shenyu Lin, Huiqi Liang, Linfeng Zhang, Guolin Ke

Main category: cs.CV

TL;DR: RxnBench is a new benchmark for evaluating Multimodal LLMs on chemical reaction understanding from scientific PDFs, revealing significant gaps in models’ ability to comprehend graphical chemical language and perform deep chemical reasoning.

Details

Motivation: Current MLLMs lack rigorous evaluation on their ability to understand the dense, graphical language of chemical reactions in authentic scientific literature, which is crucial for revolutionizing scientific discovery in chemistry.

Method: Created RxnBench with two tasks: Single-Figure QA (1,525 questions from 305 reaction schemes testing visual perception and mechanistic reasoning) and Full-Document QA (synthesizing information from 108 articles requiring cross-modal integration of text, schemes, and tables).

Result: MLLMs show critical capability gaps - they excel at text extraction but struggle with deep chemical logic and precise structural recognition. Models with inference-time reasoning outperform standard architectures, but none achieve 50% accuracy on FD-QA.

Conclusion: There’s an urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists, as current MLLMs cannot adequately comprehend chemical reaction language in scientific literature.

Abstract: The integration of Multimodal Large Language Models (MLLMs) into chemistry promises to revolutionize scientific discovery, yet their ability to comprehend the dense, graphical language of reactions within authentic literature remains underexplored. Here, we introduce RxnBench, a multi-tiered benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs. RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning using 1,525 questions derived from 305 curated reaction schemes, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles, requiring cross-modal integration of text, schemes, and tables. Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic and precise structural recognition. Notably, models with inference-time reasoning significantly outperform standard architectures, yet none achieve 50% accuracy on FD-QA. These findings underscore the urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists.

[275] Towards Integrating Uncertainty for Domain-Agnostic Segmentation

Jesse Brouwers, Xiaoyan Xing, Alexander Timans

Main category: cs.CV

TL;DR: UncertSAM benchmark evaluates uncertainty quantification methods for SAM segmentation models to improve robustness in challenging domains like shadows, transparency, and camouflage.

Details

Motivation: SAM models show strong zero-shot performance but remain vulnerable in shifted or limited-knowledge domains. The paper investigates whether uncertainty quantification can mitigate these challenges and enhance model generalizability in a domain-agnostic manner.

Method: 1) Created UncertSAM benchmark with eight datasets designed to stress-test SAM under challenging segmentation conditions; 2) Evaluated lightweight, post-hoc uncertainty estimation methods; 3) Assessed preliminary uncertainty-guided prediction refinement.

Result: Last-layer Laplace approximation yields uncertainty estimates that correlate well with segmentation errors, indicating meaningful signal. While refinement benefits are preliminary, uncertainty shows potential for improving robust, domain-agnostic performance.

Conclusion: Uncertainty quantification can enhance segmentation model robustness in challenging domains. The UncertSAM benchmark and code are publicly available to support further research in this direction.

Abstract: Foundation models for segmentation such as the Segment Anything Model (SAM) family exhibit strong zero-shot performance, but remain vulnerable in shifted or limited-knowledge domains. This work investigates whether uncertainty quantification can mitigate such challenges and enhance model generalisability in a domain-agnostic manner. To this end, we (1) curate UncertSAM, a benchmark comprising eight datasets designed to stress-test SAM under challenging segmentation conditions including shadows, transparency, and camouflage; (2) evaluate a suite of lightweight, post-hoc uncertainty estimation methods; and (3) assess a preliminary uncertainty-guided prediction refinement step. Among evaluated approaches, a last-layer Laplace approximation yields uncertainty estimates that correlate well with segmentation errors, indicating a meaningful signal. While refinement benefits are preliminary, our findings underscore the potential of incorporating uncertainty into segmentation models to support robust, domain-agnostic performance. Our benchmark and code are made publicly available.

[276] Automated river gauge plate reading using a hybrid object detection and generative AI framework in the Limpopo River Basin

Kayathri Vigneswaran, Hugo Retief, Jai Clifford Holmes, Mariangel Garcia Andarcia, Hansaka Tennakoon

Main category: cs.CV

TL;DR: A hybrid framework combining vision-based waterline detection, YOLOv8 scale extraction, and multimodal LLMs (GPT-4o/Gemini) for automated river gauge reading achieves high accuracy with MAE of 5.43 cm.

Details

Motivation: Traditional hydrological monitoring methods suffer from manual measurement errors and environmental limitations, creating need for automated, accurate river water level monitoring for flood forecasting and water resource management.

Method: Sequential framework: image preprocessing → annotation → waterline detection → YOLOv8 pose scale extraction → LLM-based numeric reading extraction. Combines computer vision for geometric calibration with multimodal LLMs for reading interpretation.

Result: Waterline detection: 94.24% precision, 83.64% F1 score. Gemini Stage 2 achieved best performance: MAE=5.43 cm, RMSE=8.58 cm, R²=0.84. LLMs sensitive to image quality - degraded images increased errors. Scale gap metadata significantly improved LLM predictions.

Conclusion: Hybrid approach combining geometric metadata with multimodal AI provides scalable, efficient solution for automated hydrological monitoring. Demonstrates potential for real-time river gauge digitization and improved water resource management.

Abstract: Accurate and continuous monitoring of river water levels is essential for flood forecasting, water resource management, and ecological protection. Traditional hydrological observation methods are often limited by manual measurement errors and environmental constraints. This study presents a hybrid framework integrating vision based waterline detection, YOLOv8 pose scale extraction, and large multimodal language models (GPT 4o and Gemini 2.0 Flash) for automated river gauge plate reading. The methodology involves sequential stages of image preprocessing, annotation, waterline detection, scale gap estimation, and numeric reading extraction. Experiments demonstrate that waterline detection achieved high precision of 94.24 percent and an F1 score of 83.64 percent, while scale gap detection provided accurate geometric calibration for subsequent reading extraction. Incorporating scale gap metadata substantially improved the predictive performance of LLMs, with Gemini Stage 2 achieving the highest accuracy, with a mean absolute error of 5.43 cm, root mean square error of 8.58 cm, and R squared of 0.84 under optimal image conditions. Results highlight the sensitivity of LLMs to image quality, with degraded images producing higher errors, and underscore the importance of combining geometric metadata with multimodal artificial intelligence for robust water level estimation. Overall, the proposed approach offers a scalable, efficient, and reliable solution for automated hydrological monitoring, demonstrating potential for real time river gauge digitization and improved water resource management.

[277] Deterministic Image-to-Image Translation via Denoising Brownian Bridge Models with Dual Approximators

Bohan Xiao, Peiyong Wang, Qisheng He, Ming Dong

Main category: cs.CV

TL;DR: A novel denoising Brownian bridge model with dual approximators (Dual-approx Bridge) for deterministic image-to-image translation that produces consistent, high-fidelity outputs with negligible variance.

Details

Motivation: To address the need for deterministic I2I translation that guarantees consistent, predictable outputs closely matching ground truth, particularly in applications like image super-resolution where high fidelity and faithfulness are crucial.

Method: Proposes a denoising Brownian bridge model with dual neural network approximators - one for forward process and one for reverse process - leveraging Brownian bridge dynamics to achieve deterministic translation with minimal variance.

Result: Extensive experiments on benchmark datasets for image generation and super-resolution show consistent superior performance in image quality and faithfulness to ground truth compared to both stochastic and deterministic baselines.

Conclusion: The Dual-approx Bridge model effectively addresses deterministic I2I translation challenges, producing high-quality, faithful outputs with negligible variance, demonstrating advantages over existing approaches.

Abstract: Image-to-Image (I2I) translation involves converting an image from one domain to another. Deterministic I2I translation, such as in image super-resolution, extends this concept by guaranteeing that each input generates a consistent and predictable output, closely matching the ground truth (GT) with high fidelity. In this paper, we propose a denoising Brownian bridge model with dual approximators (Dual-approx Bridge), a novel generative model that exploits the Brownian bridge dynamics and two neural network-based approximators (one for forward and one for reverse process) to produce faithful output with negligible variance and high image quality in I2I translations. Our extensive experiments on benchmark datasets including image generation and super-resolution demonstrate the consistent and superior performance of Dual-approx Bridge in terms of image quality and faithfulness to GT when compared to both stochastic and deterministic baselines. Project page and code: https://github.com/bohan95/dual-app-bridge

[278] MCI-Net: A Robust Multi-Domain Context Integration Network for Point Cloud Registration

Shuyuan Lin, Wenwu Peng, Junjie Huang, Qiang Qi, Miaohui Wang, Jian Weng

Main category: cs.CV

TL;DR: MCI-Net improves point cloud registration by aggregating contextual cues from diverse domains using graph neighborhood aggregation, progressive context interaction, and dynamic inlier selection.

Details

Motivation: Existing deep learning methods for point cloud registration rely on Euclidean neighborhood strategies that fail to capture implicit semantics and structural consistency in point clouds, limiting feature representation quality.

Method: Proposes a multi-domain context integration network with three key components: 1) Graph neighborhood aggregation module for global structural relationships, 2) Progressive context interaction module for intra-domain feature decoupling and inter-domain context interaction, 3) Dynamic inlier selection method using residual information from multiple pose estimation iterations.

Result: Achieves state-of-the-art performance with 96.4% registration recall on 3DMatch benchmark, outperforming existing methods on both indoor RGB-D and outdoor LiDAR datasets.

Conclusion: MCI-Net effectively addresses limitations of Euclidean neighborhood-based approaches by integrating multi-domain contextual information, resulting in more robust and discriminative feature learning for high-quality point cloud registration.

Abstract: Robust and discriminative feature learning is critical for high-quality point cloud registration. However, existing deep learning-based methods typically rely on Euclidean neighborhood-based strategies for feature extraction, which struggle to effectively capture the implicit semantics and structural consistency in point clouds. To address these issues, we propose a multi-domain context integration network (MCI-Net) that improves feature representation and registration performance by aggregating contextual cues from diverse domains. Specifically, we propose a graph neighborhood aggregation module, which constructs a global graph to capture the overall structural relationships within point clouds. We then propose a progressive context interaction module to enhance feature discriminability by performing intra-domain feature decoupling and inter-domain context interaction. Finally, we design a dynamic inlier selection method that optimizes inlier weights using residual information from multiple iterations of pose estimation, thereby improving the accuracy and robustness of registration. Extensive experiments on indoor RGB-D and outdoor LiDAR datasets show that the proposed MCI-Net significantly outperforms existing state-of-the-art methods, achieving the highest registration recall of 96.4% on 3DMatch. Source code is available at http://www.linshuyuan.com.

[279] SC-Net: Robust Correspondence Learning via Spatial and Cross-Channel Context

Shuyuan Lin, Hailiang Liao, Qiang Qi, Junjie Huang, Taotao Lai, Jian Weng

Main category: cs.CV

TL;DR: SC-Net: A novel network for two-view correspondence learning that integrates bilateral context from spatial and channel perspectives to address limitations of CNN backbones in handling large disparity scenes.

Details

Motivation: CNN backbones in two-view correspondence learning often fail to effectively aggregate global context and oversmooth dense motion fields in scenes with large disparity, requiring more tailored solutions.

Method: Proposes SC-Net with three key modules: 1) Adaptive Focused Regularization (AFR) for position-awareness and robustness against spurious motion, 2) Bilateral Field Adjustment (BFA) to refine motion fields by modeling long-range relationships across spatial/channel dimensions, and 3) Position-Aware Recovery (PAR) to ensure consistency and precision in motion vector recovery.

Result: Outperforms state-of-the-art methods in relative pose estimation and outlier removal tasks on YFCC100M and SUN3D datasets.

Conclusion: SC-Net effectively addresses CNN backbone limitations in two-view correspondence learning by integrating bilateral context, demonstrating superior performance in challenging scenes with large disparity.

Abstract: Recent research has focused on using convolutional neural networks (CNNs) as the backbones in two-view correspondence learning, demonstrating significant superiority over methods based on multilayer perceptrons. However, CNN backbones that are not tailored to specific tasks may fail to effectively aggregate global context and oversmooth dense motion fields in scenes with large disparity. To address these problems, we propose a novel network named SC-Net, which effectively integrates bilateral context from both spatial and channel perspectives. Specifically, we design an adaptive focused regularization module (AFR) to enhance the model’s position-awareness and robustness against spurious motion samples, thereby facilitating the generation of a more accurate motion field. We then propose a bilateral field adjustment module (BFA) to refine the motion field by simultaneously modeling long-range relationships and facilitating interaction across spatial and channel dimensions. Finally, we recover the motion vectors from the refined field using a position-aware recovery module (PAR) that ensures consistency and precision. Extensive experiments demonstrate that SC-Net outperforms state-of-the-art methods in relative pose estimation and outlier removal tasks on YFCC100M and SUN3D datasets. Source code is available at http://www.linshuyuan.com.

[280] TV-RAG: A Temporal-aware and Semantic Entropy-Weighted Framework for Long Video Retrieval and Understanding

Zongsheng Cao, Yangfan He, Anran Liu, Feng Chen, Zepeng Wang, Jun Xie

Main category: cs.CV

TL;DR: TV-RAG is a training-free architecture that improves long-video reasoning for Large Video Language Models by combining temporal alignment with entropy-guided semantics, outperforming existing baselines on major benchmarks.

Details

Motivation: Current LVLMs struggle with long videos due to narrow temporal windows and inability to detect fine-grained semantic shifts over extended durations. Traditional text-based retrieval pipelines also fail to capture temporal interdependence among visual, audio, and subtitle channels.

Method: TV-RAG introduces two main mechanisms: 1) a time-decay retrieval module that injects explicit temporal offsets into similarity computation to rank text queries by their true multimedia context, and 2) an entropy-weighted key-frame sampler that selects evenly spaced, information-dense frames to reduce redundancy while preserving representativeness.

Result: The system consistently surpasses most leading baselines across established long-video benchmarks including Video-MME, MLVU, and LongVideoBench, confirming its effectiveness.

Conclusion: TV-RAG provides a lightweight, budget-friendly upgrade path that can be grafted onto any LVLM without re-training or fine-tuning, offering improved long-video reasoning through temporal and semantic signal integration.

Abstract: Large Video Language Models (LVLMs) have rapidly emerged as the focus of multimedia AI research. Nonetheless, when confronted with lengthy videos, these models struggle: their temporal windows are narrow, and they fail to notice fine-grained semantic shifts that unfold over extended durations. Moreover, mainstream text-based retrieval pipelines, which rely chiefly on surface-level lexical overlap, ignore the rich temporal interdependence among visual, audio, and subtitle channels. To mitigate these limitations, we propose TV-RAG, a training-free architecture that couples temporal alignment with entropy-guided semantics to improve long-video reasoning. The framework contributes two main mechanisms: \emph{(i)} a time-decay retrieval module that injects explicit temporal offsets into the similarity computation, thereby ranking text queries according to their true multimedia context; and \emph{(ii)} an entropy-weighted key-frame sampler that selects evenly spaced, information-dense frames, reducing redundancy while preserving representativeness. By weaving these temporal and semantic signals together, TV-RAG realises a dual-level reasoning routine that can be grafted onto any LVLM without re-training or fine-tuning. The resulting system offers a lightweight, budget-friendly upgrade path and consistently surpasses most leading baselines across established long-video benchmarks such as Video-MME, MLVU, and LongVideoBench, confirming the effectiveness of our model. The code can be found at https://github.com/AI-Researcher-Team/TV-RAG.

[281] Multi-label Classification with Panoptic Context Aggregation Networks

Mingyuan Jiu, Hailong Zhu, Wenchuan Wei, Hichem Sahbi, Rongrong Ji, Mingliang Xu

Main category: cs.CV

TL;DR: PanCAN is a novel network that hierarchically integrates multi-order geometric contexts through cross-scale feature aggregation in Hilbert space, improving multi-label image classification by modeling cross-scale contextual interactions.

Details

Motivation: Current context modeling approaches focus on basic geometric relationships or localized features, neglecting cross-scale contextual interactions between objects, which limits their ability to capture complex scene understanding.

Method: Deep Panoptic Context Aggregation Network (PanCAN) learns multi-order neighborhood relationships at each scale by combining random walks with attention mechanism, cascades modules from different scales, selects salient anchors at finer scales, and dynamically fuses neighborhood features via attention in high-dimensional Hilbert space.

Result: PanCAN consistently achieves competitive results on NUS-WIDE, PASCAL VOC2007, and MS-COCO benchmarks, outperforming state-of-the-art techniques in both quantitative and qualitative evaluations, substantially improving multi-label classification performance.

Conclusion: PanCAN effectively enhances complex scene understanding by combining multi-order and cross-scale context-aware features through hierarchical integration of geometric contexts, demonstrating superior performance in multi-label image classification tasks.

Abstract: Context modeling is crucial for visual recognition, enabling highly discriminative image representations by integrating both intrinsic and extrinsic relationships between objects and labels in images. A limitation in current approaches is their focus on basic geometric relationships or localized features, often neglecting cross-scale contextual interactions between objects. This paper introduces the Deep Panoptic Context Aggregation Network (PanCAN), a novel approach that hierarchically integrates multi-order geometric contexts through cross-scale feature aggregation in a high-dimensional Hilbert space. Specifically, PanCAN learns multi-order neighborhood relationships at each scale by combining random walks with an attention mechanism. Modules from different scales are cascaded, where salient anchors at a finer scale are selected and their neighborhood features are dynamically fused via attention. This enables effective cross-scale modeling that significantly enhances complex scene understanding by combining multi-order and cross-scale context-aware features. Extensive multi-label classification experiments on NUS-WIDE, PASCAL VOC2007, and MS-COCO benchmarks demonstrate that PanCAN consistently achieves competitive results, outperforming state-of-the-art techniques in both quantitative and qualitative evaluations, thereby substantially improving multi-label classification performance.

[282] IdentityStory: Taming Your Identity-Preserving Generator for Human-Centric Story Generation

Donghao Zhou, Jingyu Lin, Guibao Shen, Quande Liu, Jialin Gao, Lihao Liu, Lan Du, Cunjian Chen, Chi-Wing Fu, Xiaowei Hu, Pheng-Ann Heng

Main category: cs.CV

TL;DR: IdentityStory is a framework for generating human-centric stories with consistent character identities across multiple sequential images, outperforming existing methods on face consistency and supporting multi-character combinations.

Details

Motivation: Current visual generative models can generate stories with consistent characters from text, but human-centric story generation faces challenges in maintaining detailed and diverse human face consistency and coordinating multiple characters across different images.

Method: The framework features two key components: 1) Iterative Identity Discovery for extracting cohesive character identities, and 2) Re-denoising Identity Injection for re-denoising images to inject identities while preserving desired context.

Result: Experiments on the ConsiStory-Human benchmark show that IdentityStory outperforms existing methods, particularly in face consistency, and supports multi-character combinations.

Conclusion: The framework demonstrates strong potential for applications such as infinite-length story generation and dynamic character composition in human-centric storytelling.

Abstract: Recent visual generative models enable story generation with consistent characters from text, but human-centric story generation faces additional challenges, such as maintaining detailed and diverse human face consistency and coordinating multiple characters across different images. This paper presents IdentityStory, a framework for human-centric story generation that ensures consistent character identity across multiple sequential images. By taming identity-preserving generators, the framework features two key components: Iterative Identity Discovery, which extracts cohesive character identities, and Re-denoising Identity Injection, which re-denoises images to inject identities while preserving desired context. Experiments on the ConsiStory-Human benchmark demonstrate that IdentityStory outperforms existing methods, particularly in face consistency, and supports multi-character combinations. The framework also shows strong potential for applications such as infinite-length story generation and dynamic character composition.

[283] Iterative Inference-time Scaling with Adaptive Frequency Steering for Image Super-Resolution

Hexin Zhang, Dong Li, Jie Huang, Bingzhou Wang, Xueyang Fu, Zhengjun Zha

Main category: cs.CV

TL;DR: IAFS is a training-free framework that uses iterative refinement and frequency-aware particle fusion to balance perceptual quality and structural fidelity in diffusion-based image super-resolution.

Details

Motivation: Existing diffusion-based SR methods struggle to guarantee both high-frequency perceptual quality and low-frequency structural fidelity simultaneously. Current inference-time scaling strategies are suboptimal - reward-driven optimization causes perceptual over-smoothing while optimal-path search loses structural consistency.

Method: IAFS combines iterative refinement and frequency-aware particle fusion. It progressively refines generated images through iterative correction of structural deviations while adaptively integrating high-frequency perceptual cues with low-frequency structural information for effective frequency fusion.

Result: Extensive experiments across multiple diffusion-based SR models show IAFS effectively resolves the perception-fidelity conflict, yielding consistently improved perceptual detail and structural accuracy, outperforming existing inference-time scaling methods.

Conclusion: IAFS provides an effective training-free solution to balance perceptual quality and structural fidelity in diffusion-based image super-resolution through iterative refinement and adaptive frequency steering.

Abstract: Diffusion models have become a leading paradigm for image super-resolution (SR), but existing methods struggle to guarantee both the high-frequency perceptual quality and the low-frequency structural fidelity of generated images. Although inference-time scaling can theoretically improve this trade-off by allocating more computation, existing strategies remain suboptimal: reward-driven particle optimization often causes perceptual over-smoothing, while optimal-path search tends to lose structural consistency. To overcome these difficulties, we propose Iterative Diffusion Inference-Time Scaling with Adaptive Frequency Steering (IAFS), a training-free framework that jointly leverages iterative refinement and frequency-aware particle fusion. IAFS addresses the challenge of balancing perceptual quality and structural fidelity by progressively refining the generated image through iterative correction of structural deviations. Simultaneously, it ensures effective frequency fusion by adaptively integrating high-frequency perceptual cues with low-frequency structural information, allowing for a more accurate and balanced reconstruction across different image details. Extensive experiments across multiple diffusion-based SR models show that IAFS effectively resolves the perception-fidelity conflict, yielding consistently improved perceptual detail and structural accuracy, and outperforming existing inference-time scaling methods.

[284] PurifyGen: A Risk-Discrimination and Semantic-Purification Model for Safe Text-to-Image Generation

Zongsheng Cao, Yangfan He, Anran Liu, Jun Xie, Feng Chen, Zepeng Wang

Main category: cs.CV

TL;DR: PurifyGen is a training-free, plug-and-play safety method for text-to-image diffusion models that purifies risky prompts through dual-space transformation without modifying model weights.

Details

Motivation: Current safety methods for text-to-image diffusion models (like text blacklisting or harmful content classification) have significant drawbacks: they can be easily circumvented, require extensive datasets, or need extra training. There's a need for a more robust, training-free approach that maintains model integrity while preventing unsafe content generation.

Method: PurifyGen uses a dual-stage strategy: 1) Token-level safety evaluation using complementary semantic distance to measure proximity between prompt tokens and toxic/clean concept embeddings, 2) For risky prompts, dual-space transformation that projects toxic-aligned embeddings into the null space of toxic concepts (removing harmful semantics) while aligning them into the range space of clean concepts (reinforcing safe semantics). The approach selectively replaces only risky token embeddings to minimize disruption.

Result: Extensive testing shows PurifyGen surpasses current methods in reducing unsafe content across five datasets and competes well with training-dependent approaches. It offers strong generalization to unseen prompts and models while maintaining the model’s original weights.

Conclusion: PurifyGen provides an effective, theoretically-grounded, plug-and-play solution for safe text-to-image generation that doesn’t require retraining, maintains model integrity, and offers better safety performance than existing methods.

Abstract: Recent advances in diffusion models have notably enhanced text-to-image (T2I) generation quality, but they also raise the risk of generating unsafe content. Traditional safety methods like text blacklisting or harmful content classification have significant drawbacks: they can be easily circumvented or require extensive datasets and extra training. To overcome these challenges, we introduce PurifyGen, a novel, training-free approach for safe T2I generation that retains the model’s original weights. PurifyGen introduces a dual-stage strategy for prompt purification. First, we evaluate the safety of each token in a prompt by computing its complementary semantic distance, which measures the semantic proximity between the prompt tokens and concept embeddings from predefined toxic and clean lists. This enables fine-grained prompt classification without explicit keyword matching or retraining. Tokens closer to toxic concepts are flagged as risky. Second, for risky prompts, we apply a dual-space transformation: we project toxic-aligned embeddings into the null space of the toxic concept matrix, effectively removing harmful semantic components, and simultaneously align them into the range space of clean concepts. This dual alignment purifies risky prompts by both subtracting unsafe semantics and reinforcing safe ones, while retaining the original intent and coherence. We further define a token-wise strategy to selectively replace only risky token embeddings, ensuring minimal disruption to safe content. PurifyGen offers a plug-and-play solution with theoretical grounding and strong generalization to unseen prompts and models. Extensive testing shows that PurifyGen surpasses current methods in reducing unsafe content across five datasets and competes well with training-dependent approaches. The code can refer to https://github.com/AI-Researcher-Team/PurifyGen.

[285] ThinkGen: Generalized Thinking for Visual Generation

Siyu Jiao, Yiheng Lin, Yujie Zhong, Qi She, Wei Zhou, Xiaohan Lan, Zilong Huang, Fei Yu, Yingchen Yu, Yunqing Zhao, Yao Zhao, Yunchao Wei

Main category: cs.CV

TL;DR: ThinkGen is a think-driven visual generation framework that leverages MLLM’s Chain-of-Thought reasoning for various generation tasks using a decoupled MLLM-DiT architecture with separable GRPO training.

Details

Motivation: Current CoT reasoning in MLLMs is effective for understanding tasks but limited for generation tasks due to scenario-specific mechanisms that hinder generalization and adaptation to diverse generative scenarios.

Method: ThinkGen uses a decoupled architecture with a pretrained MLLM and Diffusion Transformer (DiT). The MLLM generates tailored instructions based on user intent, and DiT produces images guided by these instructions. It employs SepGRPO training paradigm that alternates reinforcement learning between MLLM and DiT modules.

Result: Extensive experiments show ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks.

Conclusion: ThinkGen successfully extends CoT reasoning to generation tasks through a flexible framework that enables joint training across diverse datasets, facilitating effective reasoning for a wide range of generative scenarios.

Abstract: Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM’s CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks. Code is available: https://github.com/jiaosiyuu/ThinkGen

[286] Image Denoising Using Global and Local Circulant Representation

Zhaoming Kong, Xiaowei Yang, Jiahuan Zhang

Main category: cs.CV

TL;DR: Haar-tSVD: A fast, one-step denoising method combining tensor SVD with Haar transform for efficient noise removal without learning local bases.

Details

Motivation: Addressing the growing demand for efficient image denoising due to proliferation of imaging devices and massive daily image data generation.

Method: Establishes theoretical PCA-Haar connection, uses unified tensor SVD projection with Haar transform to capture global/local correlations, includes adaptive noise estimation via eigenvalue analysis, and integrates DNNs for severe noise conditions.

Result: Experimental results demonstrate efficiency and effectiveness across various denoising datasets.

Conclusion: Haar-tSVD provides a fast, parallelizable plug-and-play denoiser balancing speed and performance, with code publicly available.

Abstract: The proliferation of imaging devices and countless image data generated every day impose an increasingly high demand on efficient and effective image denoising. In this paper, we establish a theoretical connection between principal component analysis (PCA) and the Haar transform under circulant representation, and present a computationally simple denoising algorithm. The proposed method, termed Haar-tSVD, exploits a unified tensor singular value decomposition (t-SVD) projection combined with Haar transform to efficiently capture global and local patch correlations. Haar-tSVD operates as a one-step, parallelizable plug-and-play denoiser that eliminates the need for learning local bases, thereby striking a balance between denoising speed and performance. Besides, an adaptive noise estimation scheme is introduced to improve robustness according to eigenvalue analysis of the circulant structure. To further enhance the performance under severe noise conditions, we integrate deep neural networks with Haar-tSVD based on the established Haar-PCA relationship. Experimental results on various denoising datasets demonstrate the efficiency and effectiveness of proposed method for noise removal. Our code is publicly available at https://github.com/ZhaomingKong/Haar-tSVD.

[287] ProGuard: Towards Proactive Multimodal Safeguard

Shaohan Yu, Lijun Li, Chenyang Si, Lu Sheng, Jing Shao

Main category: cs.CV

TL;DR: ProGuard is a vision-language proactive safety guard that identifies and describes out-of-distribution safety risks in multimodal content without requiring model adjustments, outperforming existing methods on OOD risk detection and description.

Details

Motivation: Existing defense methods are limited in addressing rapidly evolving multimodal safety risks from generative models, as they typically require model adjustments and are reactive rather than proactive in handling out-of-distribution threats.

Method: 1) Constructed modality-balanced dataset of 87K samples with binary safety labels and hierarchical risk categories; 2) Trained vision-language base model purely through reinforcement learning; 3) Introduced OOD safety category inference task with synonym-bank-based similarity reward to encourage concise descriptions of unseen unsafe categories.

Result: ProGuard achieves performance comparable to closed-source large models on binary safety classification, substantially outperforms existing open-source guard models on unsafe content categorization, and improves OOD risk detection by 52.6% and OOD risk description by 64.8%.

Conclusion: ProGuard demonstrates strong proactive moderation capabilities for multimodal safety risks, effectively addressing the limitations of traditional reactive approaches and providing a robust solution for identifying and describing out-of-distribution safety threats without requiring model adjustments.

Abstract: The rapid evolution of generative models has led to a continuous emergence of multimodal safety risks, exposing the limitations of existing defense methods. To address these challenges, we propose ProGuard, a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks without the need for model adjustments required by traditional reactive approaches. We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories under a hierarchical multimodal safety taxonomy, effectively mitigating modality bias and ensuring consistent moderation across text, image, and text-image inputs. Based on this dataset, we train our vision-language base model purely through reinforcement learning (RL) to achieve efficient and concise reasoning. To approximate proactive safety scenarios in a controlled setting, we further introduce an OOD safety category inference task and augment the RL objective with a synonym-bank-based similarity reward that encourages the model to generate concise descriptions for unseen unsafe categories. Experimental results show that ProGuard achieves performance comparable to closed-source large models on binary safety classification, substantially outperforms existing open-source guard models on unsafe content categorization. Most notably, ProGuard delivers a strong proactive moderation ability, improving OOD risk detection by 52.6% and OOD risk description by 64.8%.

[288] LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

Ethan Chern, Zhulin Hu, Bohao Tang, Jiadi Su, Steffi Chern, Zhijie Deng, Pengfei Liu

Main category: cs.CV

TL;DR: Real-time interactive video diffusion via improved distillation enables 20x faster inference with multimodal conditioning (text/image/audio) for seamless human-AI interaction.

Details

Motivation: Current diffusion models for video generation are too slow for real-time interaction due to bidirectional attention and iterative denoising. Existing distillation methods focus on text-to-video and don't handle multimodal conditioning well, making human-AI interaction unnatural and inefficient.

Method: Improved distillation recipe focusing on quality of condition inputs, initialization, and schedule for on-policy optimization to address visual artifacts in multimodal conditioning. Integrated with audio language models and long-form video inference technique (Anchor-Heavy Identity Sinks) to build LiveTalk system.

Result: Distilled model matches visual quality of full-step bidirectional baselines with 20x less inference cost/latency. LiveTalk system outperforms Sora2 and Veo3 in multi-turn video coherence and content quality, reducing response latency from 1-2 minutes to real-time generation.

Conclusion: The improved distillation approach enables real-time multimodal interactive video generation, bridging the gap for seamless human-AI interaction through the LiveTalk system.

Abstract: Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.

[289] Same or Not? Enhancing Visual Perception in Vision-Language Models

Damiano Marsili, Aditya Mehta, Ryan Y. Lin, Georgia Gkioxari

Main category: cs.CV

TL;DR: TWIN introduces a large-scale dataset of 561k image-pair queries to enhance VLMs’ fine-grained perceptual abilities by training models to distinguish visually similar objects, improving performance on fine-grained recognition tasks without compromising general VQA capabilities.

Details

Motivation: Current vision-language models (VLMs) are coarse-grained, exhibit visual biases, and miss subtle visual details. Existing training corpora focus too much on general recognition tasks rather than fine-grained perception, limiting models' ability to distinguish nuanced visual differences.

Method: Created TWIN dataset with 561,000 image-pair queries where models must determine if two visually similar images show the same object. Introduced FGVQA benchmark suite of 12,000 queries repurposed from fine-grained recognition/retrieval datasets across multiple domains. Fine-tuned VLMs on TWIN to enhance perceptual precision.

Result: VLMs fine-tuned on TWIN show up to 19.3% improvement on FGVQA benchmark across unseen domains (art, animals, plants, landmarks). Performance gains scale with dataset size, and improvements don’t compromise general VQA benchmark performance.

Conclusion: TWIN dataset effectively enhances VLMs’ fine-grained perceptual abilities, demonstrating that scale is key to performance. The dataset can be easily integrated into open-source VLM training pipelines to advance perceptual precision in future models.

Abstract: Vision-language models (VLMs) excel at broad visual understanding but remain coarse-grained, exhibit visual biases, and miss subtle visual details. Existing training corpora reinforce this limitation by emphasizing general recognition (“Is it a cat or a dog?”) over fine-grained perception. To address this, we introduce a new training corpus and task designed to enhance the perceptual abilities of VLMs. TWIN is a large-scale dataset of 561,000 image-pair queries that task models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. The dataset spans a diverse range of everyday objects across contexts, viewpoints, and appearances. Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks. To quantify these gains, we introduce FGVQA, a benchmark suite of 12,000 queries that repurposes fine-grained recognition and retrieval datasets from multiple domains. While existing VLMs struggle on FGVQA, when fine-tuned on TWIN they improve by up to 19.3%, without compromising performance on general VQA benchmarks. Finally, our TWIN dataset scales favorably with object annotations, and our analysis shows that scale is key to performance. We envision TWIN as a drop-in addition to open-source VLM training corpora, advancing perceptual precision of future models. Project webpage: https://glab-caltech.github.io/twin/

[290] Detection Fire in Camera RGB-NIR

Nguyen Truong Khai, Luong Duc Vinh

Main category: cs.CV

TL;DR: This paper addresses fire detection challenges in infrared night vision cameras by introducing a new NIR dataset, a two-stage detection model combining YOLOv11 and EfficientNetV2-B0, and Patched-YOLO for improved RGB fire detection.

Details

Motivation: Current fire detection models struggle with accuracy in infrared night vision scenarios, particularly with false positives from artificial lights being misclassified as fire, and limitations in existing datasets.

Method: Three main approaches: 1) Enhanced NIR dataset with data augmentation, 2) Two-stage pipeline combining YOLOv11 for initial detection and EfficientNetV2-B0 for verification to reduce false positives, 3) Patched-YOLO using patch-based processing for improved small/distant object detection in RGB images.

Result: The proposed two-stage model achieves higher detection accuracy than previous methods (YOLOv7: 0.51 mAP50-95, RT-DETR: 0.65 mAP50-95, YOLOv9: 0.598 mAP50-95), particularly improving night-time fire detection and reducing false positives from artificial lights.

Conclusion: The paper successfully addresses fire detection challenges through dataset enhancement, a novel two-stage architecture that reduces false positives, and patch-based processing for improved small object detection, advancing infrared night vision fire detection capabilities.

Abstract: Improving the accuracy of fire detection using infrared night vision cameras remains a challenging task. Previous studies have reported strong performance with popular detection models. For example, YOLOv7 achieved an mAP50-95 of 0.51 using an input image size of 640 x 1280, RT-DETR reached an mAP50-95 of 0.65 with an image size of 640 x 640, and YOLOv9 obtained an mAP50-95 of 0.598 at the same resolution. Despite these results, limitations in dataset construction continue to cause issues, particularly the frequent misclassification of bright artificial lights as fire. This report presents three main contributions: an additional NIR dataset, a two-stage detection model, and Patched-YOLO. First, to address data scarcity, we explore and apply various data augmentation strategies for both the NIR dataset and the classification dataset. Second, to improve night-time fire detection accuracy while reducing false positives caused by artificial lights, we propose a two-stage pipeline combining YOLOv11 and EfficientNetV2-B0. The proposed approach achieves higher detection accuracy compared to previous methods, particularly for night-time fire detection. Third, to improve fire detection in RGB images, especially for small and distant objects, we introduce Patched-YOLO, which enhances the model’s detection capability through patch-based processing. Further details of these contributions are discussed in the following sections.

[291] Scalable Residual Feature Aggregation Framework with Hybrid Metaheuristic Optimization for Robust Early Pancreatic Neoplasm Detection in Multimodal CT Imaging

Janani Annur Thiruvengadam, Kiran Mayee Nabigaru, Anusha Kovi

Main category: cs.CV

TL;DR: Proposed SRFA framework for early pancreatic tumor detection achieves 96.23% accuracy using multimodal CT imaging with hybrid ViT-EfficientNet-B3 classification and metaheuristic feature optimization.

Details

Motivation: Early pancreatic tumor detection is challenging due to minimal contrast margins and large anatomical variations in CT scans, requiring systems that enhance subtle visual cues and generalize well across multimodal imaging data.

Method: SRFA framework with preprocessing, MAGRes-UNet segmentation, DenseNet-121 feature extraction, HHO-BA metaheuristic feature selection, hybrid ViT-EfficientNet-B3 classification, and dual SSA-GWO hyperparameter optimization.

Result: Achieved 96.23% accuracy, 95.58% F1-score, and 94.83% specificity, significantly outperforming traditional CNNs and contemporary transformer-based models.

Conclusion: SRFA framework shows strong potential as a useful instrument for early pancreatic tumor detection, demonstrating superior performance through integrated preprocessing, segmentation, feature optimization, and hybrid classification.

Abstract: The early detection of pancreatic neoplasm is a major clinical dilemma, and it is predominantly so because tumors are likely to occur with minimal contrast margins and a large spread anatomy-wide variation amongst patients on a CT scan. These complexities require to be addressed with an effective and scalable system that can assist in enhancing the salience of the subtle visual cues and provide a high level of the generalization on the multimodal imaging data. A Scalable Residual Feature Aggregation (SRFA) framework is proposed to be used to meet these conditions in this study. The framework integrates a pipeline of preprocessing followed by the segmentation using the MAGRes-UNet that is effective in making the pancreatic structures and isolating regions of interest more visible. DenseNet-121 performed with residual feature storage is used to extract features to allow deep hierarchical features to be aggregated without properties loss. To go further, hybrid HHO-BA metaheuristic feature selection strategy is used, which guarantees the best feature subset refinement. To be classified, the system is trained based on a new hybrid model that integrates the ability to pay attention on the world, which is the Vision Transformer (ViT) with the high representational efficiency of EfficientNet-B3. A dual optimization mechanism incorporating SSA and GWO is used to fine-tune hyperparameters to enhance greater robustness and less overfitting. Experimental results support the significant improvement in performance, with the suggested model reaching 96.23% accuracy, 95.58% F1-score and 94.83% specificity, the model is significantly better than the traditional CNNs and contemporary transformer-based models. Such results highlight the possibility of the SRFA framework as a useful instrument in the early detection of pancreatic tumors.

[292] Memorization in 3D Shape Generation: An Empirical Study

Shu Pu, Boya Zeng, Kaichen Zhou, Mengyu Wang, Zhuang Liu

Main category: cs.CV

TL;DR: The paper develops an evaluation framework to quantify memorization in 3D generative models, analyzes factors affecting memorization, and proposes strategies to reduce it without compromising generation quality.

Details

Motivation: To understand whether 3D generative models memorize training shapes, which could lead to data leakage and reduced diversity, and to develop ways to mitigate this issue.

Method: Designs an evaluation framework to quantify memorization, applies it to existing methods, and conducts controlled experiments with a latent vector-set diffusion model to study data and modeling factors.

Result: Memorization depends on data modality, increases with data diversity and finer-grained conditioning, peaks at moderate guidance scale, and can be mitigated by longer vector-sets and simple rotation augmentation.

Conclusion: The framework provides empirical understanding of memorization in 3D generative models and suggests effective strategies to reduce it without degrading generation quality.

Abstract: Generative models are increasingly used in 3D vision to synthesize novel shapes, yet it remains unclear whether their generation relies on memorizing training shapes. Understanding their memorization could help prevent training data leakage and improve the diversity of generated results. In this paper, we design an evaluation framework to quantify memorization in 3D generative models and study the influence of different data and modeling designs on memorization. We first apply our framework to quantify memorization in existing methods. Next, through controlled experiments with a latent vector-set (Vecset) diffusion model, we find that, on the data side, memorization depends on data modality, and increases with data diversity and finer-grained conditioning; on the modeling side, it peaks at a moderate guidance scale and can be mitigated by longer Vecsets and simple rotation augmentation. Together, our framework and analysis provide an empirical understanding of memorization in 3D generative models and suggest simple yet effective strategies to reduce it without degrading generation quality. Our code is available at https://github.com/zlab-princeton/3d_mem.

[293] Rethinking the Spatio-Temporal Alignment of End-to-End 3D Perception

Xiaoyu Li, Peidong Li, Xian Wu, Long Shi, Dedong Liu, Yitao Wu, Jiajia Fu, Dixiao Cui, Lijun Zhao, Lining Sun

Main category: cs.CV

TL;DR: HAT is a spatio-temporal alignment module for autonomous driving perception that adaptively decodes optimal alignment proposals from multiple motion hypotheses without supervision, improving 3D detection and tracking performance.

Details

Motivation: Existing methods use attention mechanisms with simplified motion models (like constant velocity) for cross-frame alignment, but variations in motion states and object features across categories make this suboptimal. There's a need for better explicit motion modeling in temporal perception.

Method: HAT uses multiple explicit motion models to generate spatial anchors and motion-aware feature proposals for historical instances. It performs multi-hypothesis decoding by incorporating semantic and motion cues from cached object queries to provide optimal alignment proposals for target frames.

Result: Achieves state-of-the-art tracking with 46.0% AMOTA on nuScenes test set with DETR3D. Improves perception accuracy (+1.3% mAP, +3.1% AMOTA) and reduces collision rate by 32% in end-to-end AD. Maintains robust performance when semantics are corrupted (nuScenes-C).

Conclusion: HAT demonstrates that adaptive multi-hypothesis alignment with explicit motion modeling significantly improves temporal perception in autonomous driving, enhancing both accuracy and safety while maintaining robustness to semantic corruption.

Abstract: Spatio-temporal alignment is crucial for temporal modeling of end-to-end (E2E) perception in autonomous driving (AD), providing valuable structural and textural prior information. Existing methods typically rely on the attention mechanism to align objects across frames, simplifying the motion model with a unified explicit physical model (constant velocity, etc.). These approaches prefer semantic features for implicit alignment, challenging the importance of explicit motion modeling in the traditional perception paradigm. However, variations in motion states and object features across categories and frames render this alignment suboptimal. To address this, we propose HAT, a spatio-temporal alignment module that allows each object to adaptively decode the optimal alignment proposal from multiple hypotheses without direct supervision. Specifically, HAT first utilizes multiple explicit motion models to generate spatial anchors and motion-aware feature proposals for historical instances. It then performs multi-hypothesis decoding by incorporating semantic and motion cues embedded in cached object queries, ultimately providing the optimal alignment proposal for the target frame. On nuScenes, HAT consistently improves 3D temporal detectors and trackers across diverse baselines. It achieves state-of-the-art tracking results with 46.0% AMOTA on the test set when paired with the DETR3D detector. In an object-centric E2E AD method, HAT enhances perception accuracy (+1.3% mAP, +3.1% AMOTA) and reduces the collision rate by 32%. When semantics are corrupted (nuScenes-C), the enhancement of motion modeling by HAT enables more robust perception and planning in the E2E AD.

[294] OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding

Keda Tao, Wenjie Du, Bohan Yu, Weiqiang Wang, Jian Liu, Huan Wang

Main category: cs.CV

TL;DR: OmniAgent is an audio-guided active perception agent that dynamically orchestrates specialized tools for fine-grained audio-visual reasoning, achieving state-of-the-art performance on audio-video understanding benchmarks.

Details

Motivation: Current omnimodal LLMs lack fine-grained cross-modal understanding and struggle with multimodal alignment. They rely on rigid workflows and dense frame-captioning rather than active inquiry.

Method: OmniAgent uses dynamic planning to autonomously orchestrate tool invocation on demand, employing a novel coarse-to-fine audio-guided perception paradigm that leverages audio cues to localize temporal events and guide reasoning.

Result: OmniAgent achieves state-of-the-art performance on three audio-video understanding benchmarks, surpassing leading open-source and proprietary models by 10-20% accuracy margins.

Conclusion: The paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry, showing that dynamic, audio-guided active perception significantly improves fine-grained audio-visual reasoning capabilities.

Abstract: Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often lack the fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, a fully audio-guided active perception agent that dynamically orchestrates specialized tools to achieve more fine-grained audio-visual reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, this paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and proprietary models by substantial margins of 10% - 20% accuracy.

[295] IDT: A Physically Grounded Transformer for Feed-Forward Multi-View Intrinsic Decomposition

Kang Du, Yirui Guan, Zeyu Wang

Main category: cs.CV

TL;DR: IDT is a feed-forward transformer framework for multi-view intrinsic image decomposition that produces view-consistent diffuse reflectance, diffuse shading, and specular shading without iterative sampling.

Details

Motivation: RGB images entangle material properties, illumination, and view-dependent effects, making intrinsic decomposition fundamental for visual understanding. While recent diffusion-based methods work for single-view decomposition, they struggle with multi-view settings, leading to severe view inconsistency.

Method: IDT uses transformer-based attention to jointly reason over multiple input images, producing view-consistent intrinsic factors in a single forward pass without iterative generative sampling. It adopts a physically grounded image formation model that explicitly decomposes images into diffuse reflectance, diffuse shading, and specular shading, separating Lambertian and non-Lambertian light transport.

Result: Experiments on synthetic and real-world datasets show IDT achieves cleaner diffuse reflectance, more coherent diffuse shading, better-isolated specular components, and substantially improved multi-view consistency compared to prior methods.

Conclusion: IDT provides an effective feed-forward transformer framework for multi-view intrinsic image decomposition that addresses view inconsistency issues while producing physically grounded, interpretable decompositions of material and illumination effects.

Abstract: Intrinsic image decomposition is fundamental for visual understanding, as RGB images entangle material properties, illumination, and view-dependent effects. Recent diffusion-based methods have achieved strong results for single-view intrinsic decomposition; however, extending these approaches to multi-view settings remains challenging, often leading to severe view inconsistency. We propose \textbf{Intrinsic Decomposition Transformer (IDT)}, a feed-forward framework for multi-view intrinsic image decomposition. By leveraging transformer-based attention to jointly reason over multiple input images, IDT produces view-consistent intrinsic factors in a single forward pass, without iterative generative sampling. IDT adopts a physically grounded image formation model that explicitly decomposes images into diffuse reflectance, diffuse shading, and specular shading. This structured factorization separates Lambertian and non-Lambertian light transport, enabling interpretable and controllable decomposition of material and illumination effects across views. Experiments on both synthetic and real-world datasets demonstrate that IDT achieves cleaner diffuse reflectance, more coherent diffuse shading, and better-isolated specular components, while substantially improving multi-view consistency compared to prior intrinsic decomposition methods.

[296] Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

Shaocong Xu, Songlin Wei, Qizhe Wei, Zheng Geng, Hong Li, Licheng Shen, Qianpu Sun, Shu Han, Bin Ma, Bohan Li, Chongjie Ye, Yuhang Zheng, Nan Wang, Saining Zhang, Hao Zhao

Main category: cs.CV

TL;DR: DKT: A video diffusion model adapted via LoRA to predict temporally consistent depth and normals for transparent/reflective objects, achieving SOTA on transparency benchmarks and improving robotic grasping success.

Details

Motivation: Transparent objects break traditional depth perception assumptions (stereo, ToF, monocular depth) causing holes and unstable estimates. Video diffusion models already synthesize convincing transparent phenomena, suggesting they've internalized optical rules that can be leveraged for perception.

Method: 1) Created TransPhy3D synthetic corpus (11k sequences) with Blender/Cycles rendering RGB+depth+normals using physically-based ray tracing. 2) Adapted large video diffusion model with lightweight LoRA adapters to learn video-to-video translation for depth/normals. 3) Concatenated RGB and noisy depth latents in DiT backbone, co-trained on TransPhy3D and existing synthetic datasets for temporal consistency.

Result: DKT achieves zero-shot SOTA on transparency benchmarks (ClearPose, DREDS, TransPhy3D-Test), improves accuracy and temporal consistency over baselines, sets best video normal estimation on ClearPose. 1.3B version runs at ~0.17s/frame. Integrated into grasping stack, boosts success rates across translucent, reflective, and diffuse surfaces.

Conclusion: Demonstrates that “Diffusion knows transparency” - generative video priors can be efficiently repurposed (label-free) into robust, temporally coherent perception for challenging real-world manipulation tasks involving transparent/reflective objects.

Abstract: Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines, and a normal variant sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame. Integrated into a grasping stack, DKT’s depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: “Diffusion knows transparency.” Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.

[297] Enhancing Vision-Language Model Reliability with Uncertainty-Guided Dropout Decoding

Yixiong Fang, Ziran Yang, Zhaorun Chen, Zhuokai Zhao, Jiawei Zhou

Main category: cs.CV

TL;DR: DROPOUT DECODING is an inference-time method that reduces hallucinations in large vision-language models by quantifying visual token uncertainty and selectively masking uncertain tokens during decoding.

Details

Motivation: Large vision-language models are prone to misinterpreting visual inputs, leading to hallucinations and unreliable outputs. There's a need for methods to improve visual perception reliability without retraining models.

Method: The method measures visual token uncertainty by projecting tokens to text space and decomposing into aleatoric/epistemic components. It uses uncertainty-guided token dropout (applying dropout to input visual tokens during inference) and aggregates predictions from an ensemble of masked decoding contexts.

Result: Evaluations on CHAIR, THRONE, and MMBench benchmarks show significant reduction in object hallucinations and improved reliability and quality of LVLM outputs across diverse visual contexts.

Conclusion: DROPOUT DECODING effectively mitigates visual misinterpretations in LVLMs through inference-time uncertainty quantification and selective token masking, offering a practical approach to enhance model reliability without retraining.

Abstract: Large vision-language models (LVLMs) excel at multimodal tasks but are prone to misinterpreting visual inputs, often resulting in hallucinations and unreliable outputs. We present DROPOUT DECODING, a novel inference-time approach that quantifies the uncertainty of visual tokens and selectively masks uncertain tokens to improve decoding. Our method measures the uncertainty of each visual token by projecting it onto the text space and decomposing it into aleatoric and epistemic components. Specifically, we focus on epistemic uncertainty, which captures perception-related errors more effectively. Inspired by dropout regularization, we introduce uncertainty-guided token dropout, which applies the dropout principle to input visual tokens instead of model parameters, and during inference rather than training. By aggregating predictions from an ensemble of masked decoding contexts, we can robustly mitigate errors arising from visual token misinterpretations. Evaluations on benchmarks including CHAIR, THRONE, and MMBench demonstrate that DROPOUT DECODING significantly reduces object hallucinations (OH) and enhances both reliability and quality of LVLM outputs across diverse visual contexts. Code is released at https://github.com/kigb/DropoutDecoding.

[298] Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

Hau-Shiang Shiu, Chin-Yang Lin, Zhixiang Wang, Chi-Wei Hsiao, Po-Fan Yu, Yu-Chih Chen, Yu-Lun Liu

Main category: cs.CV

TL;DR: Stream-DiffVSR is a novel diffusion-based video super-resolution framework designed for efficient online processing with causal conditioning, achieving state-of-the-art perceptual quality while dramatically reducing latency from over 4600 seconds to just 0.328 seconds per 720p frame.

Details

Motivation: Existing diffusion-based VSR methods produce high perceptual quality but are impractical for latency-sensitive applications due to their reliance on future frames and expensive multi-step denoising processes, making them unsuitable for online deployment.

Method: The framework combines three key components: 1) a four-step distilled denoiser for fast inference, 2) an Auto-regressive Temporal Guidance (ARTG) module that injects motion-aligned cues during latent denoising using only past frames, and 3) a lightweight temporal-aware decoder with Temporal Processor Module (TPM) that enhances detail and temporal coherence.

Result: Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX4090 GPU, significantly outperforming prior diffusion-based methods. Compared to online SOTA TMP, it improves perceptual quality (LPIPS +0.095) while reducing latency by over 130x, achieving the lowest latency reported for diffusion-based VSR.

Conclusion: Stream-DiffVSR represents the first diffusion-based VSR method suitable for low-latency online deployment, reducing initial delay from over 4600 seconds to 0.328 seconds while maintaining strong perceptual quality through its causally conditioned architecture and efficient components.

Abstract: Diffusion-based video super-resolution (VSR) methods achieve strong perceptual quality but remain impractical for latency-sensitive settings due to reliance on future frames and expensive multi-step denoising. We propose Stream-DiffVSR, a causally conditioned diffusion framework for efficient online VSR. Operating strictly on past frames, it combines a four-step distilled denoiser for fast inference, an Auto-regressive Temporal Guidance (ARTG) module that injects motion-aligned cues during latent denoising, and a lightweight temporal-aware decoder with a Temporal Processor Module (TPM) that enhances detail and temporal coherence. Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX4090 GPU and significantly outperforms prior diffusion-based methods. Compared with the online SOTA TMP, it boosts perceptual quality (LPIPS +0.095) while reducing latency by over 130x. Stream-DiffVSR achieves the lowest latency reported for diffusion-based VSR, reducing initial delay from over 4600 seconds to 0.328 seconds, thereby making it the first diffusion VSR method suitable for low-latency online deployment. Project page: https://jamichss.github.io/stream-diffvsr-project-page/

[299] Robust Polyp Detection and Diagnosis through Compositional Prompt-Guided Diffusion Models

Jia Yu, Yan Zhu, Peiyao Fu, Tianyi Chen, Junbo Huang, Quanlin Li, Pinghong Zhou, Zhihua Wang, Fei Wu, Shuo Wang, Xian Yang

Main category: cs.CV

TL;DR: PSDM is a progressive spectrum diffusion model that uses compositional prompts from diverse clinical annotations to generate synthetic polyp images, improving model generalization for colorectal cancer screening.

Details

Motivation: Deep learning models for colorectal cancer screening struggle with generalization across diverse clinical environments, especially with out-of-distribution data. Multi-center datasets are expensive to collect, and traditional data augmentation fails to capture medical image complexity. Current diffusion models rely only on segmentation masks, missing full clinical context.

Method: Progressive Spectrum Diffusion Model (PSDM) integrates diverse clinical annotations (segmentation masks, bounding boxes, colonoscopy reports) by transforming them into compositional prompts organized into coarse and fine components. This allows capturing both broad spatial structures and fine details to generate clinically accurate synthetic images for data augmentation.

Result: PSDM significantly improves polyp detection, classification, and segmentation. On the PolypGen dataset, it increases F1 score by 2.12% and mean average precision by 3.09%, demonstrating superior performance in out-of-distribution scenarios and enhanced generalization.

Conclusion: PSDM effectively addresses generalization challenges in colorectal cancer screening by generating clinically accurate synthetic images using comprehensive clinical context, leading to improved model performance and better handling of out-of-distribution data.

Abstract: Colorectal cancer (CRC) is a significant global health concern, and early detection through screening plays a critical role in reducing mortality. While deep learning models have shown promise in improving polyp detection, classification, and segmentation, their generalization across diverse clinical environments, particularly with out-of-distribution (OOD) data, remains a challenge. Multi-center datasets like PolypGen have been developed to address these issues, but their collection is costly and time-consuming. Traditional data augmentation techniques provide limited variability, failing to capture the complexity of medical images. Diffusion models have emerged as a promising solution for generating synthetic polyp images, but the image generation process in current models mainly relies on segmentation masks as the condition, limiting their ability to capture the full clinical context. To overcome these limitations, we propose a Progressive Spectrum Diffusion Model (PSDM) that integrates diverse clinical annotations-such as segmentation masks, bounding boxes, and colonoscopy reports-by transforming them into compositional prompts. These prompts are organized into coarse and fine components, allowing the model to capture both broad spatial structures and fine details, generating clinically accurate synthetic images. By augmenting training data with PSDM-generated samples, our model significantly improves polyp detection, classification, and segmentation. For instance, on the PolypGen dataset, PSDM increases the F1 score by 2.12% and the mean average precision by 3.09%, demonstrating superior performance in OOD scenarios and enhanced generalization.

[300] RadMamba: Efficient Human Activity Recognition through Radar-based Micro-Doppler-Oriented Mamba State-Space Model

Yizhuo Wu, Francesco Fioranelli, Chang Gao

Main category: cs.CV

TL;DR: RadMamba: A lightweight Mamba-based State Space Model for radar human activity recognition that achieves high accuracy with dramatically fewer parameters than existing methods.

Details

Motivation: Radar-based HAR is privacy-preserving and robust, but current CNN/RNN/ViT/SSM solutions are computationally intensive for on-sensor deployment in distributed radar systems with strict compute, latency, and energy constraints.

Method: RadMamba combines three key techniques: (1) channel fusion with downsampling, (2) Doppler-aligned segmentation preserving physical continuity of Doppler over time, and (3) convolutional token projections to capture Doppler-span variations while retaining temporal-Doppler structure.

Result: On CW radar dataset: matches 99.8% accuracy with only 1/400 parameters of prior SSM model. On FMCW radar dataset: competitive 92.0% accuracy with ~1/10 parameters. On continuous FMCW dataset: surpasses methods by at least 3% using only 6.7k parameters.

Conclusion: RadMamba demonstrates that parameter-efficient Mamba SSMs can achieve state-of-the-art radar HAR performance while being suitable for on-sensor deployment in resource-constrained distributed radar systems.

Abstract: Radar-based Human Activity Recognition (HAR) is an attractive alternative to wearables and cameras because it preserves privacy, and is contactless and robust to occlusions. However, dominant Convolutional Neural Network (CNN)- and Recurrent Neural Network (RNN)-based solutions are computationally intensive at deployment, and recent lightweight Vision Transformer (ViT) and State Space Model (SSM) variants still exhibit substantial complexity. In this paper, we present RadMamba, a parameter-efficient, micro-Doppler-oriented Mamba SSM tailored to radar HAR under on-sensor compute, latency, and energy constraints typical of distributed radar systems. RadMamba combines (i) channel fusion with downsampling, (ii) Doppler-aligned segmentation that preserves the physical continuity of Doppler over time, and (iii) convolutional token projections that better capture Doppler-span variations, thereby retaining temporal-Doppler structure while reducing the number of Floating-point Operations per Inference (#FLOP/Inf.). Evaluated across three datasets with different radars and types of activities, RadMamba matches the prior best 99.8% accuracy of a recent SSM-based model on the Continuous Wave (CW) radar dataset, while requiring only 1/400 of its parameters. On a dataset of non-continuous activities with Frequency Modulated Continuous Wave (FMCW) radar, RadMamba remains competitive with leading 92.0% results using about 1/10 of the parameters, and on a continuous FMCW radar dataset it surpasses methods with far more parameters by at least 3%, using only 6.7k parameters. Code: https://github.com/lab-emi/AIRHAR.

[301] Enhance Multi-Scale Spatial-Temporal Coherence for Configurable Video Anomaly Detection

Kai Cheng, Xinzhe Li, Lijuan Che

Main category: cs.CV

TL;DR: Proposes a configurable Video Anomaly Detection (VAD) system with flexible solutions to avoid retraining from scratch when detection demands change, introduces a compatible dataset, and develops a multi-scale spatial-temporal coherence module for improved accuracy.

Details

Motivation: Traditional VAD methods require complete retraining when detection demands change slightly, wasting computational resources. Anomalies are ambiguous and unbounded, with different detection needs even within the same scenario.

Method: 1) Configurable VAD framework with flexible solutions to adapt to changing detection demands without retraining from scratch. 2) Multi-scale spatial-temporal coherence module to capture continuous appearance and motion patterns, with dynamic adjustment capability.

Result: Experiments demonstrate effective modeling of spatial-temporal coherence and superior configurable ability compared to previous methods.

Conclusion: The proposed configurable VAD system addresses resource inefficiency in adapting to changing detection demands while improving accuracy through multi-scale spatial-temporal coherence modeling.

Abstract: The development of unsupervised Video Anomaly Detection (VAD) relies on technologies in the field of signal processing. Since the anomaly is quite ambiguous and unbounded, different detection demands may often be raised even in one scenario. Thus, we propose to design the configurable VAD with flexible solutions targeting to solve the issue that previous methods have to train their models from scratch and waste resources when detection demands even change slightly. Moreover, we also design a dataset with good compatibility to evaluate the VAD performance when changes happen in detection demands. Besides, videos contain important information regarding continuous changes in the object’s appearance and motion. Thus, we also propose a module to establish the multi-scale spatial-temporal coherence, which improves the accuracy and has the ability to dynamically adjust and accurately capture spatial-temporal normal patterns. Experiments show that our method not only models coherence effectively but also has better configurable ability.

[302] CogStream: Context-guided Streaming Video Question Answering

Zicheng Zhao, Kangyu Wang, Shijie Li, Rui Qian, Weiyao Lin, Huabin Liu

Main category: cs.CV

TL;DR: CogStream introduces a new streaming video reasoning task requiring models to identify relevant historical context from video streams, with a new dataset and baseline model CogReasoner that uses visual compression and dialogue retrieval.

Details

Motivation: Current Vid-LLMs struggle with streaming video reasoning due to computational burden from processing all historical context and distraction from irrelevant information. Real-world streaming scenarios require selective context usage.

Method: Introduces CogStream task and dataset with hierarchical QA pairs via semi-automatic pipeline. Baseline model CogReasoner uses visual stream compression and historical dialogue retrieval to identify relevant context.

Result: Extensive experiments demonstrate the effectiveness of CogReasoner in handling the CogStream task, showing improved performance in streaming video reasoning with selective context usage.

Conclusion: CogStream addresses key limitations in streaming video reasoning, providing a challenging benchmark and effective baseline approach for real-world video understanding applications.

Abstract: Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant context distracts models from key details. This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream. To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline. Additionally, we present CogReasoner as a baseline model. It effectively tackles this task by leveraging visual stream compression and historical dialogue retrieval. Extensive experiments prove the effectiveness of this method.

[303] A Survey on Generative Modeling with Limited Data, Few Shots, and Zero Shot

Milad Abdollahzadeh, Guimeng Liu, Touba Malekzadeh, Christopher T. H. Teo, Keshigeyan Chandrasegaran, Ngai-Man Cheung

Main category: cs.CV

TL;DR: Survey paper on generative modeling under data constraints (GM-DC), covering limited-data, few-shot, and zero-shot settings with taxonomies for tasks and methods, reviewing 230+ papers.

Details

Motivation: Real-world applications in medicine, satellite imaging, and artistic domains often face limited data availability and strict constraints, making conventional generative models (GANs, diffusion models) impractical for these scenarios.

Method: Introduces two novel taxonomies: one for GM-DC tasks (unconditional vs. conditional generation, cross-domain adaptation, subject-driven modeling) and another for methodological approaches (transfer learning, data augmentation, meta-learning, frequency-aware modeling). Reviews over 230 papers and analyzes task-approach-method interactions using Sankey diagrams.

Result: Comprehensive survey providing unified perspective on key challenges (overfitting, frequency bias, incompatible knowledge transfer), systematic analysis of the field, and identification of promising research directions.

Conclusion: Provides timely roadmap for researchers and practitioners, highlighting future directions including adaptation of foundation models, holistic evaluation frameworks, and data-centric strategies for sample selection.

Abstract: Generative modeling in machine learning aims to synthesize new data samples that are statistically similar to those observed during training. While conventional generative models such as GANs and diffusion models typically assume access to large and diverse datasets, many real-world applications (e.g. in medicine, satellite imaging, and artistic domains) operate under limited data availability and strict constraints. In this survey, we examine Generative Modeling under Data Constraint (GM-DC), which includes limited-data, few-shot, and zero-shot settings. We present a unified perspective on the key challenges in GM-DC, including overfitting, frequency bias, and incompatible knowledge transfer, and discuss how these issues impact model performance. To systematically analyze this growing field, we introduce two novel taxonomies: one categorizing GM-DC tasks (e.g. unconditional vs. conditional generation, cross-domain adaptation, and subject-driven modeling), and another organizing methodological approaches (e.g. transfer learning, data augmentation, meta-learning, and frequency-aware modeling). Our study reviews over 230 papers, offering a comprehensive view across generative model types and constraint scenarios. We further analyze task-approach-method interactions using a Sankey diagram and highlight promising directions for future work, including adaptation of foundation models, holistic evaluation frameworks, and data-centric strategies for sample selection. This survey provides a timely and practical roadmap for researchers and practitioners aiming to advance generative modeling under limited data. Project website: https://sutd-visual-computing-group.github.io/gmdc-survey/.

[304] Investigation of the Impact of Synthetic Training Data in the Industrial Application of Terminal Strip Object Detection

Nico Baumgart, Markus Lange-Hegermann, Mike Mücke

Main category: cs.CV

TL;DR: Researchers investigate sim-to-real generalization for complex industrial object detection using synthetic data from 3D CAD models, focusing on terminal strip detection with minimal implementation effort.

Details

Motivation: Industrial manufacturing faces high costs for collecting and annotating large-scale training datasets for visual inspection. While synthetic data from 3D CAD models is a common solution, its effectiveness on complex industrial tasks with densely arranged and similar objects remains unclear.

Method: Created an image synthesis pipeline combining randomization and domain knowledge to generate realistic synthetic data from 3D CAD models. Built a dataset of 30,000 synthetic images and 300 manually annotated real images of terminal strips. Evaluated sim-to-real generalization using standard object detectors with fully synthetic training.

Result: The transformer-based DINO model achieved the best performance with 98.40% mean average precision on the real test set, demonstrating that the pipeline enables high-quality detections in complex industrial environments from existing CAD data with manageable synthesis effort.

Conclusion: The proposed approach successfully addresses the data scarcity problem in industrial visual inspection by creating realistic synthetic training data from CAD models, achieving excellent sim-to-real generalization for complex object detection tasks with minimal implementation effort.

Abstract: In industrial manufacturing, deploying deep learning models for visual inspection is mostly hindered by the high and often intractable cost of collecting and annotating large-scale training datasets. While image synthesis from 3D CAD models is a common solution, the individual techniques of domain and rendering randomization to create rich synthetic training datasets have been well studied mainly in simple domains. Hence, their effectiveness on complex industrial tasks with densely arranged and similar objects remains unclear. In this paper, we investigate the sim-to-real generalization performance of standard object detectors on the complex industrial application of terminal strip object detection, carefully combining randomization and domain knowledge. We describe step-by-step the creation of our image synthesis pipeline that achieves high realism with minimal implementation effort and explain how this approach could be transferred to other industrial settings. Moreover, we created a dataset comprising 30.000 synthetic images and 300 manually annotated real images of terminal strips, which is publicly available for reference and future research. To provide a baseline as a lower bound of the expectable performance in these challenging industrial parts detection tasks, we show the sim-to-real generalization performance of standard object detectors on our dataset based on a fully synthetic training. While all considered models behave similarly, the transformer-based DINO model achieves the best score with 98.40 % mean average precision on the real test set, demonstrating that our pipeline enables high quality detections in complex industrial environments from existing CAD data and with a manageable image synthesis effort.

[305] LidarDM: Generative LiDAR Simulation in a Generated World

Vlas Zyrianov, Henry Che, Zhijian Liu, Shenlong Wang

Main category: cs.CV

TL;DR: LidarDM is a novel LiDAR generative model that produces realistic, layout-aware, physically plausible, and temporally coherent LiDAR videos with two key capabilities: scenario-guided generation for autonomous driving simulations and 4D point cloud generation.

Details

Motivation: The motivation is to create a LiDAR generative model that can produce realistic and temporally coherent LiDAR data for autonomous driving applications, addressing the need for high-quality simulation data for training and testing perception models.

Method: The method uses an integrated 4D world generation framework with latent diffusion models to generate 3D scenes, combines them with dynamic actors to form 4D worlds, and then produces realistic sensory observations within this virtual environment.

Result: The approach outperforms competing algorithms in realism, temporal coherency, and layout consistency, and can be used as a generative world model simulator for training and testing perception models.

Conclusion: LidarDM represents a significant advancement in LiDAR generative modeling with its unique capabilities for scenario-guided generation and 4D point cloud synthesis, offering valuable applications for autonomous driving simulation and perception model development.

Abstract: We present LidarDM, a novel LiDAR generative model capable of producing realistic, layout-aware, physically plausible, and temporally coherent LiDAR videos. LidarDM stands out with two unprecedented capabilities in LiDAR generative modeling: (i) LiDAR generation guided by driving scenarios, offering significant potential for autonomous driving simulations, and (ii) 4D LiDAR point cloud generation, enabling the creation of realistic and temporally coherent sequences. At the heart of our model is a novel integrated 4D world generation framework. Specifically, we employ latent diffusion models to generate the 3D scene, combine it with dynamic actors to form the underlying 4D world, and subsequently produce realistic sensory observations within this virtual environment. Our experiments indicate that our approach outperforms competing algorithms in realism, temporal coherency, and layout consistency. We additionally show that LidarDM can be used as a generative world model simulator for training and testing perception models.

[306] View Selection for 3D Captioning via Diffusion Ranking

Tiange Luo, Justin Johnson, Honglak Lee

Main category: cs.CV

TL;DR: DiffuRank addresses hallucination in 3D object captioning by ranking rendered views based on alignment with 3D objects using a text-to-3D model, improving caption quality and enabling dataset expansion.

Details

Motivation: Existing scalable 3D-text annotation methods like Cap3D generate hallucinated captions due to atypical rendered views that deviate from image captioning models' training data, compromising caption quality.

Method: DiffuRank leverages a pre-trained text-to-3D model to assess alignment between 3D objects and their 2D rendered views, ranks views by alignment quality, and feeds top-ranked views to GPT4-Vision for captioning.

Result: Corrected 200k captions in Cap3D dataset and extended to 1 million captions across Objaverse and Objaverse-XL datasets. Also outperformed CLIP in Visual Question Answering tasks when applied to text-to-image models.

Conclusion: DiffuRank effectively mitigates hallucination in 3D object captioning by selecting representative views, improving caption accuracy and detail while demonstrating adaptability to other vision-language tasks.

Abstract: Scalable annotation approaches are crucial for constructing extensive 3D-text datasets, facilitating a broader range of applications. However, existing methods sometimes lead to the generation of hallucinated captions, compromising caption quality. This paper explores the issue of hallucination in 3D object captioning, with a focus on Cap3D method, which renders 3D objects into 2D views for captioning using pre-trained models. We pinpoint a major challenge: certain rendered views of 3D objects are atypical, deviating from the training data of standard image captioning models and causing hallucinations. To tackle this, we present DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the alignment between 3D objects and their 2D rendered views, where the view with high alignment closely represent the object’s characteristics. By ranking all rendered views and feeding the top-ranked ones into GPT4-Vision, we enhance the accuracy and detail of captions, enabling the correction of 200k captions in the Cap3D dataset and extending it to 1 million captions across Objaverse and Objaverse-XL datasets. Additionally, we showcase the adaptability of DiffuRank by applying it to pre-trained text-to-image models for a Visual Question Answering task, where it outperforms the CLIP model.

[307] Text-Driven Weakly Supervised OCT Lesion Segmentation with Structural Guidance

Jiaqi Yang, Nitish Mehta, Xiaoling Hu, Chao Chen, Chia-Ling Tsai

Main category: cs.CV

TL;DR: Novel weakly supervised semantic segmentation framework for OCT lesion segmentation using only image-level labels, combining structural visual processing with text-driven guidance from pretrained models to generate high-quality pseudo labels.

Details

Motivation: Pixel-level annotation for OCT image segmentation is labor-intensive and limits scalability. Weak supervision with image-level labels reduces annotation burden but carries limited information. Need to improve segmentation quality while maintaining low annotation cost.

Method: Two visual processing modules: one processes original OCT images, another uses layer segmentations augmented with anomalous signals. Textual guidance from pretrained models includes label-derived descriptions (local semantics) and domain-agnostic synthetic descriptions (spatial/relational semantics). Multi-modal fusion aligns semantic meaning with structural relevance.

Result: State-of-the-art results on three OCT datasets, demonstrating improved lesion localization and segmentation performance compared to existing methods.

Conclusion: The framework effectively integrates structural and text-driven guidance to produce high-quality pseudo labels from image-level supervision, advancing diagnostic accuracy and efficiency in medical imaging with reduced annotation burden.

Abstract: Accurate segmentation of Optical Coherence Tomography (OCT) images is crucial for diagnosing and monitoring retinal diseases. However, the labor-intensive nature of pixel-level annotation limits the scalability of supervised learning for large datasets. Weakly Supervised Semantic Segmentation (WSSS) offers a promising alternative by using weaker forms of supervision, such as image-level labels, to reduce the annotation burden. Despite its advantages, weak supervision inherently carries limited information. We propose a novel WSSS framework with only image-level labels for OCT lesion segmentation that integrates structural and text-driven guidance to produce high-quality, pixel-level pseudo labels. The framework employs two visual processing modules: one that processes the original OCT images and another that operates on layer segmentations augmented with anomalous signals, enabling the model to associate lesions with their corresponding anatomical layers. Complementing these visual cues, we leverage large-scale pretrained models to provide two forms of textual guidance: label-derived descriptions that encode local semantics, and domain-agnostic synthetic descriptions that, although expressed in natural image terms, capture spatial and relational semantics useful for generating globally consistent representations. By fusing these visual and textual features in a multi-modal framework, our method aligns semantic meaning with structural relevance, thereby improving lesion localization and segmentation performance. Experiments on three OCT datasets demonstrate state-of-the-art results, highlighting its potential to advance diagnostic accuracy and efficiency in medical imaging.

[308] ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection

Zhihao Sun, Haoran Jiang, Haoran Chen, Yixin Cao, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang

Main category: cs.CV

TL;DR: ForgerySleuth uses multimodal LLMs for image manipulation detection by fusing comprehensive clues and generating segmentation outputs, addressing hallucination issues through a new dataset and data engine.

Details

Motivation: Multimodal LLMs have shown promise for various tasks but their potential in image manipulation detection remains unexplored. Current M-LLMs suffer from hallucinations and overthinking when directly applied to IMD tasks.

Method: Proposes ForgerySleuth which leverages M-LLMs for comprehensive clue fusion and generates segmentation outputs indicating tampered regions. Uses Chain-of-Clues prompt to construct ForgeryAnalysis dataset with analysis/reasoning text. Introduces data engine for larger-scale pre-training dataset.

Result: Extensive experiments demonstrate effectiveness of ForgeryAnalysis dataset and show ForgerySleuth significantly outperforms existing methods in generalization, robustness, and explainability.

Conclusion: ForgerySleuth successfully applies M-LLMs to image manipulation detection, overcoming hallucination issues through comprehensive clue fusion and specialized dataset construction, achieving superior performance across key metrics.

Abstract: Multimodal large language models have unlocked new possibilities for various multimodal tasks. However, their potential in image manipulation detection remains unexplored. When directly applied to the IMD task, M-LLMs often produce reasoning texts that suffer from hallucinations and overthinking. To address this, we propose ForgerySleuth, which leverages M-LLMs to perform comprehensive clue fusion and generate segmentation outputs indicating specific regions that are tampered with. Moreover, we construct the ForgeryAnalysis dataset through the Chain-of-Clues prompt, which includes analysis and reasoning text to upgrade the image manipulation detection task. A data engine is also introduced to build a larger-scale dataset for the pre-training phase. Our extensive experiments demonstrate the effectiveness of ForgeryAnalysis and show that ForgerySleuth significantly outperforms existing methods in generalization, robustness, and explainability.

[309] Age-Defying Face Recognition with Transformer-Enhanced Loss

Pritesh Prakash, Anoop Kumar Rai

Main category: cs.CV

TL;DR: Transformer-metric loss combines transformer and metric losses for age-invariant face recognition, achieving state-of-the-art results on LFW and age-variant datasets.

Details

Motivation: Aging significantly challenges face recognition due to changes in skin texture and tone over time, making long-term identification difficult. Transformers can preserve sequential spatial relationships affected by aging, offering potential for more robust age-invariant recognition.

Method: Proposes transformer-metric loss that integrates transformer-loss with standard metric-loss functions. Uses transformer encoder on contextual vectors from CNN’s final convolution layer, arranged as sequential vectors to capture aging patterns. The combined loss enables learning more age-invariant features.

Result: Achieves state-of-the-art results on LFW and age-variant datasets (CA-LFW and AgeDB). The transformer-metric loss configuration outperforms standard approaches.

Conclusion: Transformer networks can effectively serve as additive loss functions in face recognition, particularly for age-invariant applications. This research expands transformers’ role in computer vision and opens new possibilities for using transformers as loss functions.

Abstract: Aging presents a significant challenge in face recognition, as changes in skin texture and tone can alter facial features over time, making it particularly difficult to compare images of the same individual taken years apart, such as in long-term identification scenarios. Transformer networks have the strength to preserve sequential spatial relationships caused by aging effect. This paper presents a technique for loss evaluation that uses a transformer network as an additive loss in the face recognition domain. The standard metric loss function typically takes the final embedding of the main CNN backbone as its input. Here, we employ a transformer-metric loss, a combined approach that integrates both transformer-loss and metric-loss. This research intends to analyze the transformer behavior on the convolution output when the CNN outcome is arranged in a sequential vector. These sequential vectors have the potential to overcome the texture or regional structure referred to as wrinkles or sagging skin affected by aging. The transformer encoder takes input from the contextual vectors obtained from the final convolution layer of the network. The learned features can be more age-invariant, complementing the discriminative power of the standard metric loss embedding. With this technique, we use transformer loss with various base metric-loss functions to evaluate the effect of the combined loss functions. We observe that such a configuration allows the network to achieve SoTA results in LFW and age-variant datasets (CA-LFW and AgeDB). This research expands the role of transformers in the machine vision domain and opens new possibilities for exploring transformers as a loss function.

[310] WiSE-OD: Benchmarking Robustness in Infrared Object Detection

Heitor R. Medeiros, Atif Belal, Masih Aminbeidokhti, Eric Granger, Marco Pedersoli

Main category: cs.CV

TL;DR: WiSE-OD improves robustness for infrared object detection by combining RGB-pretrained and IR-fine-tuned models through weight-space ensembling, evaluated on new cross-modality OOD benchmarks.

Details

Motivation: Infrared object detection suffers from limited datasets, forcing reliance on RGB-pretrained models. Fine-tuning on IR improves accuracy but reduces robustness due to modality gaps between RGB and IR domains.

Method: Proposes WiSE-OD with two variants: WiSE-OD_ZS combines RGB zero-shot and IR fine-tuned weights, and WiSE-OD_LP blends zero-shot and linear probing. Also introduces LLVIP-C and FLIR-C benchmarks by applying corruptions to standard IR datasets.

Result: WiSE-OD improves robustness across modalities and to corruption in both synthetic and real-world distribution shifts (M3FD dataset) without additional training or inference costs.

Conclusion: Weight-space ensembling effectively leverages complementary knowledge from RGB and IR-trained models, enhancing robustness for infrared object detection in out-of-distribution scenarios.

Abstract: Object detection (OD) in infrared (IR) imagery is critical for low-light and nighttime applications. However, the scarcity of large-scale IR datasets forces models to rely on weights pre-trained on RGB images. While fine-tuning on IR improves accuracy, it often compromises robustness under distribution shifts due to the inherent modality gap between RGB and IR. To address this, we introduce LLVIP-C and FLIR-C, two cross-modality out-of-distribution (OOD) benchmarks built by applying corruptions to standard IR datasets. Additionally, to fully leverage the complementary knowledge from RGB and infrared-trained models, we propose WiSE-OD, a weight-space ensembling method with two variants: WiSE-OD${ZS}$, which combines RGB zero-shot and IR fine-tuned weights, and WiSE-OD${LP}$, which blends zero-shot and linear probing. Evaluated using four RGB-pretrained detectors and two robust baselines on our benchmark and in the real-world out-of-distribution M3FD dataset, our WiSE-OD improves robustness across modalities and to corruption in synthetic and real-world distribution shifts without any additional training or inference costs. Our code is available at: https://github.com/heitorrapela/wiseod.

[311] Multi-scale Latent Point Consistency Models for 3D Shape Generation

Bi’an Du, Wei Hu, Renjie Liao

Main category: cs.CV

TL;DR: MLPCM is a multi-scale latent point consistency model for 3D shape generation that achieves 100x speedup while improving quality and diversity over diffusion models.

Details

Motivation: To extend the sampling acceleration benefits of Consistency Models from 2D image generation to 3D point cloud shape generation, addressing the computational inefficiency of diffusion models for 3D data.

Method: Proposes a latent diffusion framework with hierarchical latent representations (point-level to super-point levels), multi-scale latent integration with 3D spatial attention, and a consistency distillation approach that compresses the prior into a one-step generator.

Result: Achieves 100x speedup in generation while surpassing state-of-the-art diffusion models in both shape quality and diversity on ShapeNet and ShapeNet-Vol benchmarks.

Conclusion: MLPCM successfully adapts consistency model principles to 3D point cloud generation, dramatically improving sampling efficiency while maintaining or enhancing performance compared to traditional diffusion approaches.

Abstract: Consistency Models (CMs) have significantly accelerated the sampling process in diffusion models, yielding impressive results in synthesizing high-resolution images. To explore and extend these advancements to point-cloud-based 3D shape generation, we propose a novel Multi-scale Latent Point Consistency Model (MLPCM). Our MLPCM follows a latent diffusion framework and introduces hierarchical levels of latent representations, ranging from point-level to super-point levels, each corresponding to a different spatial resolution. We design a multi-scale latent integration module along with 3D spatial attention to effectively denoise the point-level latent representations conditioned on those from multiple super-point levels. Additionally, we propose a latent consistency model, learned through consistency distillation, that compresses the prior into a one-step generator. This significantly improves sampling efficiency while preserving the performance of the original teacher model. Extensive experiments on standard benchmarks ShapeNet and ShapeNet-Vol demonstrate that MLPCM achieves a 100x speedup in the generation process, while surpassing state-of-the-art diffusion models in terms of both shape quality and diversity.

Qifeng Zhou, Wenliang Zhong, Thao M. Dang, Hehuan Ma, Saiyang Na, Yuzhi Guo, Junzhou Huang

Main category: cs.CV

TL;DR: HOMIE transforms general multimodal LLMs into specialized pathology retrieval experts via two-stage adaptation, addressing task/domain mismatches and introducing a new Pathology Composed Retrieval benchmark.

Details

Motivation: AI in pathology needs interpretable solutions; black-box models lack transparency while generative approaches risk hallucinations. Current retrieval models can't handle composed clinical queries, and no benchmark exists for this task.

Method: HOMIE framework with two-stage adaptation: 1) retrieval-adaptation stage for task mismatch, 2) pathology-specific tuning with progressive knowledge curriculum, stain processing, and native resolution handling for domain mismatch.

Result: HOMIE matches SOTA on traditional retrieval tasks and outperforms all baselines on the new Pathology Composed Retrieval task, trained only on public data.

Conclusion: HOMIE successfully addresses the dual mismatch problem in pathology AI, enabling interpretable composed retrieval through systematic MLLM adaptation and establishing a new benchmark for the field.

Abstract: The integration of Artificial Intelligence (AI) into pathology faces a fundamental challenge: black-box predictive models lack transparency, while generative approaches risk clinical hallucination. A case-based retrieval paradigm offers a more interpretable alternative for clinical adoption. However, current SOTA models are constrained by dual-encoder architectures that cannot process the composed modality of real-world clinical queries. We formally define the task of Pathology Composed Retrieval (PCR). However, progress in this newly defined task is blocked by two critical challenges: (1) Multimodal Large Language Models (MLLMs) offer the necessary deep-fusion architecture but suffer from a critical Task Mismatch and Domain Mismatch. (2) No benchmark exists to evaluate such compositional queries. To solve these challenges, we propose HOMIE, a systematic framework that transforms a general MLLM into a specialized retrieval expert. HOMIE resolves the dual mismatch via a two-stage process: a retrieval-adaptation stage to solve the task mismatch, and a pathology-specific tuning stage, featuring a progressive knowledge curriculum, pathology specfic stain and native resolution processing, to solve the domain mismatch. We also introduce the PCR Benchmark, a benchmark designed to evaluate composed retrieval in pathology. Experiments show that HOMIE, trained only on public data, matches SOTA performance on traditional retrieval tasks and outperforms all baselines on the newly defined PCR task.

[313] Language-Informed Hyperspectral Image Synthesis for Imbalanced-Small Sample Classification via Semi-Supervised Conditional Diffusion Model

Yimin Zhu, Lincoln Linlin Xu

Main category: cs.CV

TL;DR: Txt2HSI-LDM(VAE) is a novel text-guided diffusion model for hyperspectral image synthesis that addresses imbalanced-small sample data problems by generating realistic and diverse samples using textual descriptions.

Details

Motivation: Most existing methods for addressing imbalanced-small sample data in hyperspectral image classification extend features in latent space, but few leverage text-driven generation to create realistic and diverse samples. Recent success of text-guided diffusion models in natural image synthesis motivates their application to hyperspectral data.

Method: 1) Universal VAE maps high-dimensional hyperspectral data to low-dimensional latent space for stable features and reduced diffusion complexity. 2) Semi-supervised diffusion model uses random polygon spatial clipping and uncertainty estimation to simulate varying degrees of mixing. 3) VAE decodes generated latent features with language conditions as input to produce hyperspectral images.

Result: The model generates effective synthetic samples validated through statistical characteristics and data distribution in 2D-PCA space. Visual-linguistic cross-attention visualization shows the model captures spatial layout and geometry. Performance surpasses classical backbone models, state-of-the-art CNNs, and semi-supervised methods.

Conclusion: Txt2HSI-LDM(VAE) successfully addresses the imbalanced-small sample data problem in hyperspectral image classification by leveraging text-guided diffusion models to generate realistic and diverse hyperspectral samples, demonstrating superior performance over existing approaches.

Abstract: Data augmentation effectively addresses the imbalanced-small sample data (ISSD) problem in hyperspectral image classification (HSIC). While most methodologies extend features in the latent space, few leverage text-driven generation to create realistic and diverse samples. Recently, text-guided diffusion models have gained significant attention due to their ability to generate highly diverse and high-quality images based on text prompts in natural image synthesis. Motivated by this, this paper proposes Txt2HSI-LDM(VAE), a novel language-informed hyperspectral image synthesis method to address the ISSD in HSIC. The proposed approach uses a denoising diffusion model, which iteratively removes Gaussian noise to generate hyperspectral samples conditioned on textual descriptions. First, to address the high-dimensionality of hyperspectral data, a universal variational autoencoder (VAE) is designed to map the data into a low-dimensional latent space, which provides stable features and reduces the inference complexity of diffusion model. Second, a semi-supervised diffusion model is designed to fully take advantage of unlabeled data. Random polygon spatial clipping (RPSC) and uncertainty estimation of latent feature (LF-UE) are used to simulate the varying degrees of mixing. Third, the VAE decodes HSI from latent space generated by the diffusion model with the language conditions as input. In our experiments, we fully evaluate synthetic samples’ effectiveness from statistical characteristics and data distribution in 2D-PCA space. Additionally, visual-linguistic cross-attention is visualized on the pixel level to prove that our proposed model can capture the spatial layout and geometry of the generated data. Experiments demonstrate that the performance of the proposed Txt2HSI-LDM(VAE) surpasses the classical backbone models, state-of-the-art CNNs, and semi-supervised methods.

[314] FunduSegmenter: Leveraging the RETFound Foundation Model for Joint Optic Disc and Optic Cup Segmentation in Retinal Fundus Images

Zhenyi Zhao, Muthu Rama Krishnan Mookiah, Emanuele Trucco

Main category: cs.CV

TL;DR: FunduSegmenter adapts RETFound foundation model for joint optic disc and cup segmentation in fundus images, achieving state-of-the-art performance across multiple datasets.

Details

Motivation: To explore the potential of RETFound's general representations for OD/OC segmentation, which is crucial for automated retinal analysis tasks like biomarker discovery and accurate retinal coordinate setting.

Method: Proposed FunduSegmenter integrates RETFound with novel modules: Pre-adapter, Decoder, Post-adapter, skip connections with CBAM, and ViT block adapter. Evaluated on private GoDARTS and four public datasets through internal/external verification and domain generalization experiments.

Result: Achieved 90.51% average Dice in internal verification (outperforming nnU-Net: 82.91%, DUNet: 89.17%, TransUNet: 87.91%). External verification results were ~3% higher than best baselines, with competitive domain generalization performance.

Conclusion: FunduSegmenter demonstrates strong stability and generalization for OD/OC segmentation, outperforming state-of-the-art baselines. The proposed modules are generalizable for fine-tuning other foundation models.

Abstract: Purpose: This study aims to introduce the first adaptation of RETFound for joint optic disc (OD) and optic cup (OC) segmentation. RETFound is a well-known foundation model developed for fundus camera and optical coherence tomography images, which has shown promising performance in disease diagnosis. Methods: We propose FunduSegmenter, a model integrating a series of novel modules with RETFound, including a Pre-adapter, a Decoder, a Post-adapter, skip connections with Convolutional Block Attention Module and a Vision Transformer block adapter. The model is evaluated on a private dataset, GoDARTS, and four public datasets, IDRiD, Drishti-GS, RIM-ONE-r3, and REFUGE, through internal verification, external verification and domain generalization experiments. Results: An average Dice similarity coefficient of 90.51% was achieved in internal verification, which substantially outperformed the baselines (nnU-Net: 82.91%; DUNet: 89.17%; TransUNet: 87.91%). In all external verification experiments, the average results were about 3% higher than those of the best baseline, and were also competitive in domain generalization. Conclusions: This study explored the potential of the latent general representations learned by RETFound for OD and OC segmentation in fundus camera images. Our FunduSegmenter outperformed nearly all state-of-the-art baseline methods. The proposed modules are general and can be extended to fine-tuning other foundation models. Translational Relevance: The model shows strong stability and generalization on both in-distribution and out-of-distribution data, providing stable OD and OC segmentation. This is an essential step for many automated tasks, from setting the accurate retinal coordinate to biomarker discovery. The code and all trained weights are available at: [link to be added after the paper is accepted]

[315] MP-HSIR: A Multi-Prompt Framework for Universal Hyperspectral Image Restoration

Zhehui Wu, Yong Chen, Naoto Yokoya, Wei He

Main category: cs.CV

TL;DR: MP-HSIR is a multi-prompt framework for universal hyperspectral image restoration that integrates spectral, textual, and visual prompts to handle diverse degradation types and intensities without specific degradation assumptions.

Details

Motivation: Hyperspectral images suffer from diverse unknown degradations causing spectral and spatial distortions, but existing methods rely on specific degradation assumptions, limiting their effectiveness in complex real-world scenarios.

Method: Proposes MP-HSIR with a prompt-guided spatial-spectral transformer using spatial self-attention and prompt-guided dual-branch spectral self-attention. Introduces spectral prompts for universal low-rank spectral patterns and text-visual synergistic prompts that fuse semantic representations with visual features to encode degradation information.

Result: Extensive experiments on 9 HSI restoration tasks show MP-HSIR consistently outperforms existing all-in-one methods and surpasses state-of-the-art task-specific approaches across multiple tasks in all-in-one scenarios, generalization tests, and real-world cases.

Conclusion: MP-HSIR effectively integrates multi-modal prompts to achieve universal HSI restoration across diverse degradation types and intensities, demonstrating superior performance over both specialized and general-purpose methods.

Abstract: Hyperspectral images (HSIs) often suffer from diverse and unknown degradations during imaging, leading to severe spectral and spatial distortions. Existing HSI restoration methods typically rely on specific degradation assumptions, limiting their effectiveness in complex scenarios. In this paper, we propose \textbf{MP-HSIR}, a novel multi-prompt framework that effectively integrates spectral, textual, and visual prompts to achieve universal HSI restoration across diverse degradation types and intensities. Specifically, we develop a prompt-guided spatial-spectral transformer, which incorporates spatial self-attention and a prompt-guided dual-branch spectral self-attention. Since degradations affect spectral features differently, we introduce spectral prompts in the local spectral branch to provide universal low-rank spectral patterns as prior knowledge for enhancing spectral reconstruction. Furthermore, the text-visual synergistic prompt fuses high-level semantic representations with fine-grained visual features to encode degradation information, thereby guiding the restoration process. Extensive experiments on 9 HSI restoration tasks, including all-in-one scenarios, generalization tests, and real-world cases, demonstrate that MP-HSIR not only consistently outperforms existing all-in-one methods but also surpasses state-of-the-art task-specific approaches across multiple tasks. The code and models are available at https://github.com/ZhehuiWu/MP-HSIR.

[316] DSwinIR: Rethinking Window-based Attention for Image Restoration

Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu, Liqiang Nie

Main category: cs.CV

TL;DR: DSwinIR introduces a Deformable Sliding Window Attention mechanism that replaces rigid window partitioning with token-centric sliding windows and content-aware deformable sampling for better image restoration.

Details

Motivation: Transformer-based models using window-based self-attention suffer from insufficient feature interaction across windows and limited receptive fields due to rigid, non-overlapping window partitioning schemes.

Method: Proposes Deformable Sliding Window Transformer (DSwinIR) with two components: 1) token-centric sliding window paradigm to eliminate boundary artifacts, and 2) content-aware deformable sampling that learns data-dependent offsets to actively shape receptive fields.

Result: DSwinIR achieves state-of-the-art performance on several benchmarks, surpassing GridFormer by 0.53 dB on three-task benchmark and 0.87 dB on five-task benchmark in all-in-one image restoration.

Conclusion: The proposed Deformable Sliding Window Attention provides a more adaptive and flexible attention mechanism that moves beyond grid-based fixed window partitioning, enabling better feature interaction and receptive field adaptation for image restoration tasks.

Abstract: Image restoration has witnessed significant advancements with the development of deep learning models. Transformer-based models, particularly those using window-based self-attention, have become a dominant force. However, their performance is constrained by the rigid, non-overlapping window partitioning scheme, which leads to \textit{insufficient feature interaction across windows and limited receptive fields}. This highlights the need for more adaptive and flexible attention mechanisms. In this paper, we propose the Deformable Sliding Window Transformer for Image Restoration (DSwinIR), a new attention mechanism: the {Deformable Sliding Window (DSwin) Attention}. {This mechanism introduces a token-centric and content-aware paradigm that moves beyond the grid and fixed window partition.} It comprises two complementary components. First, it replaces the rigid partitioning with a \textit{token-centric sliding window} paradigm, {making it effective at eliminating boundary artifacts}. Second, it incorporates a \textit{content-aware deformable sampling} strategy, which allows the attention mechanism to learn data-dependent offsets and actively shape its receptive field to focus on the most informative image regions. Extensive experiments show that DSwinIR achieves strong results, including state-of-the-art performance on several evaluated benchmarks. For instance, in all-in-one image restoration, our DSwinIR surpasses the most recent backbone GridFormer by 0.53 dB on the three-task benchmark and 0.87 dB on the five-task benchmark.

[317] Multi-Focused Video Group Activities Hashing

Zhongmiao Qi, Yan Jiang, Bolin Zhang, Chong Wang, Lijun Guo, Pengjiang Qian, Jiangbo Qian

Main category: cs.CV

TL;DR: Proposes STVH and M-STVH for group activity video retrieval, capturing spatiotemporal evolution of objects and interactions, with multi-focused learning for both activity semantics and object visual features.

Details

Motivation: With explosive growth of video data in complex scenarios, there's an urgent need for quick retrieval of group activities. Existing methods often retrieve entire videos rather than activity-level granularity, and real-life scenarios require both activity features and object visual features.

Method: STVH (spatiotemporal interleaved video hashing) uses unified framework to model individual object dynamics and group interactions, capturing spatiotemporal evolution. M-STVH (multi-focused spatiotemporal video hashing) enhances this with hierarchical feature integration through multi-focused representation learning to jointly focus on activity semantics and object visual features.

Result: Both STVH and M-STVH achieve excellent results in comparative experiments on publicly available datasets.

Conclusion: The proposed techniques effectively address the problem of group activity video retrieval at activity granularity rather than entire video level, with M-STVH providing enhanced capability to handle both activity semantics and object visual features as needed in real-world scenarios.

Abstract: With the explosive growth of video data in various complex scenarios, quickly retrieving group activities has become an urgent problem. However, many tasks can only retrieve videos focusing on an entire video, not the activity granularity. To solve this problem, we propose a new STVH (spatiotemporal interleaved video hashing) technique for the first time. Through a unified framework, the STVH simultaneously models individual object dynamics and group interactions, capturing the spatiotemporal evolution on both group visual features and positional features. Moreover, in real-life video retrieval scenarios, it may sometimes require activity features, while at other times, it may require visual features of objects. We then further propose a novel M-STVH (multi-focused spatiotemporal video hashing) as an enhanced version to handle this difficult task. The advanced method incorporates hierarchical feature integration through multi-focused representation learning, allowing the model to jointly focus on activity semantics features and object visual features. We conducted comparative experiments on publicly available datasets, and both STVH and M-STVH can achieve excellent results.

[318] Adapting In-Domain Few-Shot Segmentation to New Domains without Source Domain Retraining

Qi Fan, Kaiqi Liu, Nian Liu, Hisham Cholakkal, Rao Muhammad Anwer, Wenbin Li, Yang Gao

Main category: cs.CV

TL;DR: ISA adapts pre-trained FSS models to new domains without retraining by identifying and training domain-specific model structures using few-shot support samples during inference.

Details

Motivation: CD-FSS faces challenges from diverse target domains and limited support data. Existing methods require costly redesign and retraining of models using abundant source domain data, which is inefficient.

Method: 1) Adaptively identify domain-specific model structures using novel structure Fisher score to measure parameter importance; 2) Progressively train selected structures with hierarchically constructed samples from fewer to more support shots; 3) Enables flexible adaptation of existing FSS models without source domain retraining.

Result: Superior performance across multiple CD-FSS benchmarks, demonstrating effective domain shift handling and adaptation capabilities without model redesign or base data retraining.

Conclusion: ISA provides an efficient solution for CD-FSS by adapting pre-trained FSS models to new domains using few-shot support samples during inference, eliminating costly retraining while maintaining strong performance.

Abstract: Cross-domain few-shot segmentation (CD-FSS) aims to segment objects of novel classes in new domains, which is often challenging due to the diverse characteristics of target domains and the limited availability of support data. Most CD-FSS methods redesign and retrain in-domain FSS models using abundant base data from the source domain, which are effective but costly to train. To address these issues, we propose adapting informative model structures of the well-trained FSS model for target domains by learning domain characteristics from few-shot labeled support samples during inference, thereby eliminating the need for source domain retraining. Specifically, we first adaptively identify domain-specific model structures by measuring parameter importance using a novel structure Fisher score in a data-dependent manner. Then, we progressively train the selected informative model structures with hierarchically constructed training samples, progressing from fewer to more support shots. The resulting Informative Structure Adaptation (ISA) method effectively addresses domain shifts and equips existing well-trained in-domain FSS models with flexible adaptation capabilities for new domains, eliminating the need to redesign or retrain CD-FSS models on base data. Extensive experiments validate the effectiveness of our method, demonstrating superior performance across multiple CD-FSS benchmarks. Codes are at https://github.com/fanq15/ISA.

[319] Ordinal Adaptive Correction: A Data-Centric Approach to Ordinal Image Classification with Noisy Labels

Alireza Sedighi Moghaddam, Mohammad Reza Mohammadi

Main category: cs.CV

TL;DR: ORDAC is a novel data-centric method for adaptive correction of noisy labels in ordinal image classification using Label Distribution Learning to model ambiguity and dynamically adjust label distributions.

Details

Motivation: Label noise is prevalent in ordinal image classification due to ambiguous class boundaries, which degrades model performance and reliability. Existing approaches often discard noisy samples, wasting valuable training data.

Method: Proposes ORDinal Adaptive Correction (ORDAC) using Label Distribution Learning to model ordinal label ambiguity. Dynamically adjusts mean and standard deviation of label distributions for each sample during training, correcting noisy labels rather than discarding them.

Result: Significant improvements on benchmark datasets (Adience for age estimation, Diabetic Retinopathy for disease severity). On Adience with 40% noise, ORDAC_R reduced MAE from 0.86 to 0.62 and increased recall from 0.37 to 0.49. Effective in correcting intrinsic noise in original datasets.

Conclusion: Adaptive label correction using label distributions is an effective strategy to enhance robustness and accuracy of ordinal classification models in noisy data scenarios, making optimal use of entire training datasets.

Abstract: Labeled data is a fundamental component in training supervised deep learning models for computer vision tasks. However, the labeling process, especially for ordinal image classification where class boundaries are often ambiguous, is prone to error and noise. Such label noise can significantly degrade the performance and reliability of machine learning models. This paper addresses the problem of detecting and correcting label noise in ordinal image classification tasks. To this end, a novel data-centric method called ORDinal Adaptive Correction (ORDAC) is proposed for adaptive correction of noisy labels. The proposed approach leverages the capabilities of Label Distribution Learning (LDL) to model the inherent ambiguity and uncertainty present in ordinal labels. During training, ORDAC dynamically adjusts the mean and standard deviation of the label distribution for each sample. Rather than discarding potentially noisy samples, this approach aims to correct them and make optimal use of the entire training dataset. The effectiveness of the proposed method is evaluated on benchmark datasets for age estimation (Adience) and disease severity detection (Diabetic Retinopathy) under various asymmetric Gaussian noise scenarios. Results show that ORDAC and its extended versions (ORDAC_C and ORDAC_R) lead to significant improvements in model performance. For instance, on the Adience dataset with 40% noise, ORDAC_R reduced the mean absolute error from 0.86 to 0.62 and increased the recall metric from 0.37 to 0.49. The method also demonstrated its effectiveness in correcting intrinsic noise present in the original datasets. This research indicates that adaptive label correction using label distributions is an effective strategy to enhance the robustness and accuracy of ordinal classification models in the presence of noisy data.

[320] A Preliminary Study on GPT-Image Generation Model for Image Restoration

Hao Yang, Yan Yang, Ruikun Zhang, Liyuan Pan

Main category: cs.CV

TL;DR: GPT-image models produce visually pleasant but structurally inaccurate restoration results; they can serve as strong visual priors to boost existing restoration networks.

Details

Motivation: To investigate the potential impact of OpenAI's GPT-series multimodal generation models on image restoration community, given their remarkable capabilities in producing visually compelling images.

Method: Conducted first systematic benchmark across diverse restoration scenarios, evaluated GPT-image models’ restoration outputs, and demonstrated their use as visual priors integrated into existing restoration networks for dehazing, deraining, and low-light enhancement tasks.

Result: GPT-image restoration results are perceptually pleasant but lack pixel-level structural fidelity (geometry changes, object position/count alterations, perspective modifications). When used as visual priors, they significantly boost restoration quality for existing networks.

Conclusion: GPT-image models offer strong visual priors for restoration tasks, providing practical insights and baseline framework for integrating generative priors into restoration pipelines, bridging image generation models with restoration tasks.

Abstract: Recent advances in OpenAI’s GPT-series multimodal generation models have shown remarkable capabilities in producing visually compelling images. In this work, we investigate its potential impact on the image restoration community. We provide, to the best of our knowledge, the first systematic benchmark across diverse restoration scenarios. Our evaluation shows that, while the restoration results generated by GPT-Image models are often perceptually pleasant, they tend to lack pixel-level structural fidelity compared with ground-truth references. Typical deviations include changes in image geometry, object positions or counts, and even modifications in perspective. Beyond empirical observations, we further demonstrate that outputs from GPT-Image models can act as strong visual priors, offering notable performance improvements for existing restoration networks. Using dehazing, deraining, and low-light enhancement as representative case studies, we show that integrating GPT-generated priors significantly boosts restoration quality. This study not only provides practical insights and a baseline framework for incorporating GPT-based generative priors into restoration pipelines, but also highlights new opportunities for bridging image generation models and restoration tasks. To support future research, we will release GPT-restored results.

[321] ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations

Xuecheng Wu, Jiaxing Liu, Danlei Huang, Yifan Wang, Yunyun Shi, Kedi Chen, Junxiao Xue, Yang Liu, Chunlin Chen, Hairong Dong, Dingkang Yang

Main category: cs.CV

TL;DR: VI-CoT enables MLLMs to update understanding via step-wise visual states, but current benchmarks use fixed rather than free-style IVS. Authors introduce ViC-Bench with 4 tasks and systematic evaluation to assess true VI-CoT capabilities.

Details

Motivation: Current benchmarks for Visual-Interleaved Chain-of-Thought (VI-CoT) provide models with fixed intermediate visual states (IVS) rather than free-style IVS, which may distort original thinking trajectories and fail to evaluate intrinsic reasoning capabilities. Existing benchmarks also neglect systematic exploration of how IVS impacts reasoning performance.

Method: Introduce ViC-Bench with four representative tasks (maze navigation, jigsaw puzzle, embodied long-horizon planning, complex counting) each with dedicated free-style IVS generation pipeline supporting adaptive function calls. Propose progressive three-stage evaluation strategy with new metrics and Incremental Prompting Information Injection strategy to explore prompting factors.

Result: Extensively evaluated 18 advanced MLLMs, revealing key insights into their VI-CoT capability. The benchmark has been made publicly available on Huggingface.

Conclusion: ViC-Bench addresses limitations of existing benchmarks by providing free-style IVS and systematic evaluation methods, enabling better assessment of MLLMs’ true Visual-Interleaved Chain-of-Thought reasoning capabilities across diverse tasks.

Abstract: Visual-Interleaved Chain-of-Thought (VI-CoT) enables Multi-modal Large Language Models (MLLMs) to continually update their understanding and decision space based on step-wise intermediate visual states (IVS), much like a human would, which has demonstrated impressive success in various tasks, thereby leading to emerged advancements in related downstream benchmarks. Despite promising progress, current benchmarks provide models with relatively fixed IVS, rather than free-style IVS, whch might forcibly distort the original thinking trajectories, failing to evaluate their intrinsic reasoning capabilities. More importantly, existing benchmarks neglect to systematically explore the impact factors that IVS would impart to the untamed reasoning performance. To tackle above gaps, we introduce a specialized benchmark termed ViC-Bench, consisting of four representive tasks, i.e., maze navigation, jigsaw puzzle, embodied long-horizon planning, as well as complex counting, where each task has dedicated free-style IVS generation pipeline supporting adaptive function calls. To systematically examine VI-CoT capability, we propose a thorough evaluation suite incorporating a progressive three-stage strategy with targeted new metrics. Besides, we establish Incremental Prompting Information Injection strategy to ablatively explore the prompting factors for VI-CoT. We extensively conduct evaluations for 18 advanced MLLMs, revealing key insights into their VI-CoT capability. The introduced ViC-Bench has been made publicly available at Huggingface.

[322] Visual Explanation via Similar Feature Activation for Metric Learning

Yi Liao, Ugochukwu Ejike Akpudo, Jue Zhang, Yongsheng Gao, Jun Zhou, Wenyi Zeng, Weichuan Zhang

Main category: cs.CV

TL;DR: Proposes SFAM, a visual explanation method for metric learning models that lack fully connected classifiers, using channel-wise importance scores from similarity measurements between image embeddings.

Details

Motivation: Existing CAM methods (Grad-CAM, Relevance-CAM) require fully connected classifiers and cannot be applied to metric learning models, creating a gap in interpretability for these architectures.

Method: SFAM introduces channel-wise contribution importance score (CIS) derived from similarity measurements between image embeddings, then linearly combines these importance weights with CNN feature maps to create explanation maps.

Result: Quantitative and qualitative experiments demonstrate that SFAM provides highly promising interpretable visual explanations for CNN models using Euclidean distance or cosine similarity metrics.

Conclusion: SFAM successfully addresses the limitation of existing CAM methods by enabling visual explanations for metric learning models, enhancing trustworthiness and providing guidance for algorithm development in image recognition.

Abstract: Visual explanation maps enhance the trustworthiness of decisions made by deep learning models and offer valuable guidance for developing new algorithms in image recognition tasks. Class activation maps (CAM) and their variants (e.g., Grad-CAM and Relevance-CAM) have been extensively employed to explore the interpretability of softmax-based convolutional neural networks, which require a fully connected layer as the classifier for decision-making. However, these methods cannot be directly applied to metric learning models, as such models lack a fully connected layer functioning as a classifier. To address this limitation, we propose a novel visual explanation method termed Similar Feature Activation Map (SFAM). This method introduces the channel-wise contribution importance score (CIS) to measure feature importance, derived from the similarity measurement between two image embeddings. The explanation map is constructed by linearly combining the proposed importance weights with the feature map from a CNN model. Quantitative and qualitative experiments show that SFAM provides highly promising interpretable visual explanations for CNN models using Euclidean distance or cosine similarity as the similarity metric.

[323] Seeing Isn’t Believing: Context-Aware Adversarial Patch Synthesis via Conditional GAN

Roie Kazoom, Alon Goldberg, Hodaya Cohen, Ofer Hadar

Main category: cs.CV

TL;DR: A novel framework for fully controllable adversarial patch generation that allows attackers to choose both input image and target class, achieving state-of-the-art attack success rates while maintaining visual realism.

Details

Motivation: Existing adversarial patch attacks have limitations: they rely on unrealistic white-box assumptions, use untargeted objectives, or produce visually conspicuous patches that limit real-world applicability. There's a need for a method that combines realism, targeted control, and practical stealthiness.

Method: Combines a generative U-Net design with Grad-CAM-guided patch placement for semantic-aware localization. This enables precise patch placement that maximizes attack effectiveness while preserving visual realism.

Result: Achieves state-of-the-art performance with attack success rates (ASR) and target-class success (TCS) consistently exceeding 99% across various models including convolutional networks (DenseNet-121, ResNet-50) and vision transformers (ViT-B/16, Swin-B/16). Outperforms prior white-box attacks, untargeted baselines, and non-realistic approaches.

Conclusion: The framework establishes a new benchmark for adversarial robustness research by simultaneously ensuring realism, targeted control, and black-box applicability - addressing the three most challenging dimensions of patch-based attacks and bridging the gap between theoretical attack strength and practical stealthiness.

Abstract: Adversarial patch attacks pose a severe threat to deep neural networks, yet most existing approaches rely on unrealistic white-box assumptions, untargeted objectives, or produce visually conspicuous patches that limit real-world applicability. In this work, we introduce a novel framework for fully controllable adversarial patch generation, where the attacker can freely choose both the input image x and the target class y target, thereby dictating the exact misclassification outcome. Our method combines a generative U-Net design with Grad-CAM-guided patch placement, enabling semantic-aware localization that maximizes attack effectiveness while preserving visual realism. Extensive experiments across convolutional networks (DenseNet-121, ResNet-50) and vision transformers (ViT-B/16, Swin-B/16, among others) demonstrate that our approach achieves state-of-the-art performance across all settings, with attack success rates (ASR) and target-class success (TCS) consistently exceeding 99%. Importantly, we show that our method not only outperforms prior white-box attacks and untargeted baselines, but also surpasses existing non-realistic approaches that produce detectable artifacts. By simultaneously ensuring realism, targeted control, and black-box applicability-the three most challenging dimensions of patch-based attacks-our framework establishes a new benchmark for adversarial robustness research, bridging the gap between theoretical attack strength and practical stealthiness.

[324] MokA: Multimodal Low-Rank Adaptation for MLLMs

Yake Wei, Yu Miao, Dongzhan Zhou, Di Hu

Main category: cs.CV

TL;DR: MokA is a multimodal-aware efficient fine-tuning method that addresses limitations of current LLM-based approaches by considering both unimodal adaptation and cross-modal interaction for MLLMs.

Details

Motivation: Current efficient multimodal fine-tuning methods are directly borrowed from LLMs, neglecting intrinsic multimodal differences and failing to fully utilize all modalities, which limits their effectiveness.

Method: Proposes Multimodal low-rank Adaptation (MokA) with modality-specific parameters to compress unimodal information while explicitly enhancing cross-modal interaction, ensuring both unimodal and cross-modal adaptation.

Result: Extensive experiments across three multimodal scenarios (audio-visual-text, visual-text, speech-text) and multiple LLM backbones show consistent improvements, demonstrating efficacy and versatility.

Conclusion: MokA provides a more targeted solution for efficient adaptation of MLLMs, paving the way for further exploration in multimodal fine-tuning.

Abstract: In this paper, we reveal that most current efficient multimodal fine-tuning methods are hindered by a key limitation: they are directly borrowed from LLMs, often neglecting the intrinsic differences of multimodal scenarios and even affecting the full utilization of all modalities. Inspired by our empirical observation, we argue that unimodal adaptation and cross-modal adaptation are two essential parts for the effective fine-tuning of MLLMs. From this perspective, we propose Multimodal low-rank Adaptation (MokA), a multimodal-aware efficient fine-tuning strategy that takes multimodal characteristics into consideration. It compresses unimodal information by modality-specific parameters while explicitly enhancing cross-modal interaction, ensuring both unimodal and cross-modal adaptation. Extensive experiments cover three representative multimodal scenarios (audio-visual-text, visual-text, and speech-text), and multiple LLM backbones (LLaMA2/3, Qwen2, Qwen2.5-VL, etc). Consistent improvements indicate the efficacy and versatility of the proposed method. Ablation studies and efficiency evaluation are also conducted to fully asses our method. Overall, we think MokA provides a more targeted solution for efficient adaptation of MLLMs, paving the way for further exploration. The project page is at https://gewu-lab.github.io/MokA.

[325] It’s Not the Target, It’s the Background: Rethinking Infrared Small Target Detection via Deep Patch-Free Low-Rank Representations

Guoyi Zhang, Guangsheng Xu, Siyang Chen, Han Wang, Xiaohu Zhang

Main category: cs.CV

TL;DR: LRRNet: A novel end-to-end infrared small target detection framework that learns low-rank background representations directly in the image domain using a compression-reconstruction-subtraction paradigm, achieving state-of-the-art performance with real-time speed.

Details

Motivation: Infrared small target detection faces challenges due to low signal-to-clutter ratios, diverse target morphologies, and lack of distinctive visual cues. Existing deep learning approaches struggle with the intrinsic variability and weak priors of small targets, leading to unstable performance.

Method: LRRNet leverages the low-rank property of infrared image backgrounds through a compression-reconstruction-subtraction (CRS) paradigm. It directly models structure-aware low-rank background representations in the image domain without patch-based processing or explicit matrix decomposition, using deep neural networks in an end-to-end manner.

Result: Extensive experiments show LRRNet outperforms 38 state-of-the-art methods in detection accuracy, robustness, and computational efficiency. It achieves real-time performance with 82.34 FPS average speed and demonstrates resilience to sensor noise on the challenging NoisySIRST dataset.

Conclusion: LRRNet is the first work to directly learn low-rank background structures using deep neural networks in an end-to-end manner, providing an effective solution for infrared small target detection with superior performance and practical efficiency.

Abstract: \textcolor{blue}{This is the pre-acceptance version, to read the final version please go to \href{https://ieeexplore.ieee.org/document/11156113}{IEEE Transactions on Geoscience and Remote Sensing on IEEE Xplore}.} Infrared small target detection (IRSTD) remains a long-standing challenge in complex backgrounds due to low signal-to-clutter ratios (SCR), diverse target morphologies, and the absence of distinctive visual cues. While recent deep learning approaches aim to learn discriminative representations, the intrinsic variability and weak priors of small targets often lead to unstable performance. In this paper, we propose a novel end-to-end IRSTD framework, termed LRRNet, which leverages the low-rank property of infrared image backgrounds. Inspired by the physical compressibility of cluttered scenes, our approach adopts a compression–reconstruction–subtraction (CRS) paradigm to directly model structure-aware low-rank background representations in the image domain, without relying on patch-based processing or explicit matrix decomposition. To the best of our knowledge, this is the first work to directly learn low-rank background structures using deep neural networks in an end-to-end manner. Extensive experiments on multiple public datasets demonstrate that LRRNet outperforms 38 state-of-the-art methods in terms of detection accuracy, robustness, and computational efficiency. Remarkably, it achieves real-time performance with an average speed of 82.34 FPS. Evaluations on the challenging NoisySIRST dataset further confirm the model’s resilience to sensor noise. The source code will be made publicly available upon acceptance.

[326] Fine-Tuned Vision Transformers Capture Complex Wheat Spike Morphology for Volume Estimation from RGB Images

Olivia Zumsteg, Nico Graf, Aaron Haeusler, Norbert Kirchgessner, Nicola Storni, Lukas Roth, Andreas Hund

Main category: cs.CV

TL;DR: This paper presents deep learning approaches for estimating 3D wheat spike volume from 2D RGB images, using 3D scans as ground truth. Fine-tuned Vision Transformers (DINOv2/v3) achieve best results, outperforming CNNs and traditional geometric methods.

Details

Motivation: Wheat spike volume is highly correlated with spike dry weight (fruiting efficiency), making it valuable for phenotyping. However, estimating 3D volume from 2D images is challenging due to depth information loss, projection distortions, and occlusions in field conditions.

Method: Multiple approaches compared: 1) Fine-tuned Vision Transformers (DINOv2/DINOv3) with MLPs, 2) Fine-tuned CNNs (ResNet18/ResNet50), 3) Wheat-specific backbones, 4) Traditional baselines (2D area-based projection and geometric reconstruction using axis-aligned cross-sections). Used structured-light 3D scans as ground truth for training and evaluation.

Result: DINOv3 achieved lowest MAPE of 4.67% and highest correlation of 0.97 on six-view indoor images. On field-based single side-view images, fine-tuned DINOv3 achieved MAPE of 8.39% and correlation of 0.90. Vision Transformers outperformed CNNs, wheat-specific backbones, and traditional geometric methods. Object shape significantly impacts accuracy, with irregular geometries like wheat spikes posing greater challenges for geometric methods than deep learning.

Conclusion: The paper provides a novel pipeline for fast, accurate, non-destructive wheat spike volume phenotyping using deep learning. Fine-tuned Vision Transformers offer superior performance, demonstrating that improved high-level representations enable simple MLPs to outperform more complex architectures like LSTMs after fine-tuning.

Abstract: Estimating three-dimensional morphological traits such as volume from two-dimensional RGB images presents inherent challenges due to the loss of depth information, projection distortions, and occlusions under field conditions. In this work, we explore multiple approaches for non-destructive volume estimation of wheat spikes using RGB images and structured-light 3D scans as ground truth references. Wheat spike volume is promising for phenotyping as it shows high correlation with spike dry weight, a key component of fruiting efficiency. Accounting for the complex geometry of the spikes, we compare different neural network approaches for volume estimation from 2D images and benchmark them against two conventional baselines: a 2D area-based projection and a geometric reconstruction using axis-aligned cross-sections. Fine-tuned Vision Transformers (DINOv2 and DINOv3) with MLPs achieve the lowest MAPE of 5.08% and 4.67% and the highest correlation of 0.96 and 0.97 on six-view indoor images, outperforming fine-tuned CNNs (ResNet18 and ResNet50), wheat-specific backbones, and both baselines. When using frozen DINO backbones, deep-supervised LSTMs outperform MLPs, whereas after fine-tuning, improved high-level representations allow simple MLPs to outperform LSTMs. We demonstrate that object shape significantly impacts volume estimation accuracy, with irregular geometries such as wheat spikes posing greater challenges for geometric methods than for deep learning approaches. Fine-tuning DINOv3 on field-based single side-view images yields a MAPE of 8.39% and a correlation of 0.90, providing a novel pipeline and a fast, accurate, and non-destructive approach for wheat spike volume phenotyping.

[327] Enhancing Cross-Patient Generalization in AI-Based Parkinson s Disease Detection

Mhd Adnan Albani, Riad Sonbol

Main category: cs.CV

TL;DR: Two-stage approach for Parkinson’s disease detection from hand-drawn images using chunking strategy and ensemble method, achieving high accuracy with minimal performance drop on unseen patients.

Details

Motivation: Existing Parkinson's disease detection methods from hand-drawn images suffer from insufficient datasets and poor robustness when dealing with unseen patient data.

Method: Two-stage approach: first classifies drawing types (circle, meander, spiral), then extracts features. Uses chunking strategy (2x2 image division) with separate feature extraction per chunk, followed by ensemble method to merge decisions.

Result: Achieved 97.08% accuracy for seen patients and 94.91% for unseen patients on NewHandPD dataset, maintaining only 2.17 percentage point gap compared to 4.76-point drop in prior work.

Conclusion: The proposed approach effectively addresses dataset limitations and improves robustness for unseen patients, outperforming state-of-the-art methods in Parkinson’s disease detection from hand-drawn images.

Abstract: Parkinson’s disease (PD) is a neurodegenerative disease affecting about 1% of people over the age of 60, causing motor impairments that impede hand coordination activities such as writing and drawing. Many approaches have tried to support early detection of Parkinson’s disease based on hand-drawn images; however, we identified two major limitations in the related works: (1) the lack of sufficient datasets, (2) the robustness when dealing with unseen patient data. In this paper, we propose a new approach to detect Parkinson’s disease that consists of two stages: The first stage classifies based on their drawing type(circle, meander, spiral), and the second stage extracts the required features from the images and detects Parkinson’s disease. We overcame the previous two limitations by applying a chunking strategy where we divide each image into 2x2 chunks. Each chunk is processed separately when extracting features and recognizing Parkinson’s disease indicators. To make the final classification, an ensemble method is used to merge the decisions made from each chunk. Our evaluation shows that our proposed approach outperforms the top performing state-of-the-art approaches, in particular on unseen patients. On the NewHandPD dataset our approach, it achieved 97.08% accuracy for seen patients and 94.91% for unseen patients, our proposed approach maintained a gap of only 2.17 percentage points, compared to the 4.76-point drop observed in prior work.

[328] MatDecompSDF: High-Fidelity 3D Shape and PBR Material Decomposition from Multi-View Images

Chengyu Wang, Isabella Bennett, Henry Scott, Liang Zhang, Mei Chen, Hao Li, Rui Zhao

Main category: cs.CV

TL;DR: MatDecompSDF is a framework that recovers 3D shapes and decomposes material properties from multi-view images using neural SDF geometry, PBR material fields, and environmental lighting models with differentiable rendering and physical priors.

Details

Motivation: The core challenge of inverse rendering is the ill-posed disentanglement of geometry, materials, and illumination from 2D observations. Existing methods struggle with robust decomposition of physically-based material properties while maintaining high-fidelity geometry.

Method: Joint optimization of three neural components: neural SDF for geometry, spatially-varying neural field for PBR material parameters (albedo, roughness, metallic), and MLP-based environmental lighting model. Uses physically-based differentiable rendering layer with material smoothness loss and Eikonal loss for regularization.

Result: Surpasses state-of-the-art methods on synthetic and real-world datasets (DTU) in geometric accuracy, material fidelity, and novel view synthesis. Produces editable, relightable assets compatible with standard graphics pipelines.

Conclusion: MatDecompSDF effectively solves the inverse rendering problem by combining neural representations with physical priors, enabling practical digital content creation through high-quality 3D reconstruction and material decomposition.

Abstract: We present MatDecompSDF, a novel framework for recovering high-fidelity 3D shapes and decomposing their physically-based material properties from multi-view images. The core challenge of inverse rendering lies in the ill-posed disentanglement of geometry, materials, and illumination from 2D observations. Our method addresses this by jointly optimizing three neural components: a neural Signed Distance Function (SDF) to represent complex geometry, a spatially-varying neural field for predicting PBR material parameters (albedo, roughness, metallic), and an MLP-based model for capturing unknown environmental lighting. The key to our approach is a physically-based differentiable rendering layer that connects these 3D properties to the input images, allowing for end-to-end optimization. We introduce a set of carefully designed physical priors and geometric regularizations, including a material smoothness loss and an Eikonal loss, to effectively constrain the problem and achieve robust decomposition. Extensive experiments on both synthetic and real-world datasets (e.g., DTU) demonstrate that MatDecompSDF surpasses state-of-the-art methods in geometric accuracy, material fidelity, and novel view synthesis. Crucially, our method produces editable and relightable assets that can be seamlessly integrated into standard graphics pipelines, validating its practical utility for digital content creation.

[329] Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models

L’ea Dubois, Klaus Schmidt, Chengyu Wang, Ji-Hoon Park, Lin Wang, Santiago Munoz

Main category: cs.CV

TL;DR: A novel video understanding framework that combines Vision Foundation Models with Large Language Models to enable advanced cognitive reasoning like causality and future prediction, achieving state-of-the-art performance with strong zero-shot generalization.

Details

Motivation: Current video understanding models only recognize "what" is happening but lack high-level cognitive abilities like causal reasoning and future prediction due to insufficient commonsense world knowledge. There's a need to bridge this cognitive gap between visual perception and reasoning.

Method: Proposes a framework that fuses a Vision Foundation Model (VFM) for visual perception with a Large Language Model (LLM) as a knowledge-driven reasoning core. Uses a sophisticated fusion module inspired by Q-Former architecture to distill spatiotemporal and object-centric visual features into language-aligned representations. Employs two-stage training: large-scale alignment pre-training on video-text data followed by targeted instruction fine-tuning on curated reasoning datasets.

Result: Achieves state-of-the-art performance on multiple challenging benchmarks. Demonstrates remarkable zero-shot generalization to unseen reasoning tasks. Ablation studies validate the critical contribution of each architectural component.

Conclusion: This work pushes machine perception from simple recognition towards genuine cognitive understanding, paving the way for more intelligent AI systems in robotics, human-computer interaction, and other applications requiring advanced reasoning capabilities.

Abstract: Current video understanding models excel at recognizing “what” is happening but fall short in high-level cognitive tasks like causal reasoning and future prediction, a limitation rooted in their lack of commonsense world knowledge. To bridge this cognitive gap, we propose a novel framework that synergistically fuses a powerful Vision Foundation Model (VFM) for deep visual perception with a Large Language Model (LLM) serving as a knowledge-driven reasoning core. Our key technical innovation is a sophisticated fusion module, inspired by the Q-Former architecture, which distills complex spatiotemporal and object-centric visual features into a concise, language-aligned representation. This enables the LLM to effectively ground its inferential processes in direct visual evidence. The model is trained via a two-stage strategy, beginning with large-scale alignment pre-training on video-text data, followed by targeted instruction fine-tuning on a curated dataset designed to elicit advanced reasoning and prediction skills. Extensive experiments demonstrate that our model achieves state-of-the-art performance on multiple challenging benchmarks. Notably, it exhibits remarkable zero-shot generalization to unseen reasoning tasks, and our in-depth ablation studies validate the critical contribution of each architectural component. This work pushes the boundary of machine perception from simple recognition towards genuine cognitive understanding, paving the way for more intelligent and capable AI systems in robotics, human-computer interaction, and beyond.

[330] RiemanLine: Riemannian Manifold Representation of 3D Lines for Factor Graph Optimization

Yan Li, Ze Yang, Keisuke Tateno, Federico Tombari, Liang Zhao, Gim Hee Lee

Main category: cs.CV

TL;DR: RiemanLine: A unified minimal Riemannian representation for 3D lines that handles both individual lines and parallel-line groups, reducing parameter space and improving camera localization accuracy.

Details

Motivation: Existing 3D line representations in robotics and computer vision handle only independent lines, ignoring structural regularities like parallel lines that are common in man-made environments. This limits efficiency and accuracy in camera localization and mapping.

Method: Decouples line landmarks into global (shared vanishing direction on unit sphere S²) and local (scaled normal vectors on orthogonal subspaces) components. For n parallel lines, reduces parameters from 4n to 2n+2. Integrated into factor graph framework for unified manifold-based bundle adjustment.

Result: Extensive experiments on ICL-NUIM, TartanAir, and synthetic benchmarks show significantly more accurate pose estimation and line reconstruction, while reducing parameter dimensionality and improving convergence stability.

Conclusion: RiemanLine provides a unified minimal representation that naturally embeds structural regularities like parallelism without explicit constraints, enabling more efficient and accurate 3D line-based camera localization and mapping in structured environments.

Abstract: Minimal parametrization of 3D lines plays a critical role in camera localization and structural mapping. Existing representations in robotics and computer vision predominantly handle independent lines, overlooking structural regularities such as sets of parallel lines that are pervasive in man-made environments. This paper introduces \textbf{RiemanLine}, a unified minimal representation for 3D lines formulated on Riemannian manifolds that jointly accommodates both individual lines and parallel-line groups. Our key idea is to decouple each line landmark into global and local components: a shared vanishing direction optimized on the unit sphere $\mathcal{S}^2$, and scaled normal vectors constrained on orthogonal subspaces, enabling compact encoding of structural regularities. For $n$ parallel lines, the proposed representation reduces the parameter space from $4n$ (orthonormal form) to $2n+2$, naturally embedding parallelism without explicit constraints. We further integrate this parameterization into a factor graph framework, allowing global direction alignment and local reprojection optimization within a unified manifold-based bundle adjustment. Extensive experiments on ICL-NUIM, TartanAir, and synthetic benchmarks demonstrate that our method achieves significantly more accurate pose estimation and line reconstruction, while reducing parameter dimensionality and improving convergence stability.

[331] When Deepfake Detection Meets Graph Neural Network:a Unified and Lightweight Learning Framework

Haoyu Liu, Chaoyu Gong, Mengke He, Jiate Li, Kai Han, Siqiang Luo

Main category: cs.CV

TL;DR: SSTGNN is a lightweight graph neural network framework for detecting AI-generated/manipulated videos using joint spatial-spectral-temporal analysis with 42x fewer parameters than SOTA models.

Details

Motivation: Existing video manipulation detection methods fail to generalize across diverse manipulation types due to reliance on isolated spatial, temporal, or spectral information, and typically require large models, making real-world deployment challenging.

Method: SSTGNN represents videos as structured graphs and uses a Spatial-Spectral-Temporal Graph Neural Network framework with learnable spectral filters and spatial-temporal differential modeling for joint reasoning over spatial inconsistencies, temporal artifacts, and spectral distortions.

Result: SSTGNN achieves superior performance in both in-domain and cross-domain settings on diverse benchmark datasets while being highly efficient with up to 42x fewer parameters than state-of-the-art models.

Conclusion: SSTGNN provides an effective, lightweight, and resource-friendly solution for real-world deployment of AI-generated video detection by capturing subtle manipulation traces through unified spatial-spectral-temporal analysis.

Abstract: The proliferation of generative video models has made detecting AI-generated and manipulated videos an urgent challenge. Existing detection approaches often fail to generalize across diverse manipulation types due to their reliance on isolated spatial, temporal, or spectral information, and typically require large models to perform well. This paper introduces SSTGNN, a lightweight Spatial-Spectral-Temporal Graph Neural Network framework that represents videos as structured graphs, enabling joint reasoning over spatial inconsistencies, temporal artifacts, and spectral distortions. SSTGNN incorporates learnable spectral filters and spatial-temporal differential modeling into a unified graph-based architecture, capturing subtle manipulation traces more effectively. Extensive experiments on diverse benchmark datasets demonstrate that SSTGNN not only achieves superior performance in both in-domain and cross-domain settings, but also offers strong efficiency and resource allocation. Remarkably, SSTGNN accomplishes these results with up to 42$\times$ fewer parameters than state-of-the-art models, making it highly lightweight and resource-friendly for real-world deployment.

[332] Learning Spatial Decay for Vision Transformers

Yuxin Mao, Zhen Qin, Jinxing Zhou, Bin Fan, Jing Zhang, Yiran Zhong, Yuchao Dai

Main category: cs.CV

TL;DR: SDT introduces data-dependent spatial decay to vision transformers using a Context-Aware Gating mechanism that dynamically modulates attention based on both content relevance and spatial proximity, outperforming fixed spatial decay methods.

Details

Motivation: Vision Transformers lack explicit spatial inductive biases, and existing approaches use fixed, data-independent spatial decay that applies uniform attention weighting regardless of image content, limiting adaptability to diverse visual scenarios.

Method: Spatial Decay Transformer (SDT) with Context-Aware Gating (CAG) mechanism that generates dynamic, data-dependent decay for patch interactions. Uses a unified spatial-content fusion framework integrating manhattan distance-based spatial priors with learned content representations.

Result: Extensive experiments on ImageNet-1K classification and generation tasks demonstrate consistent improvements over strong baselines.

Conclusion: Establishes data-dependent spatial decay as a new paradigm for enhancing spatial attention in vision transformers, successfully adapting content-aware gating mechanisms from language models to 2D vision.

Abstract: Vision Transformers (ViTs) have revolutionized computer vision, yet their self-attention mechanism lacks explicit spatial inductive biases, leading to suboptimal performance on spatially-structured tasks. Existing approaches introduce data-independent spatial decay based on fixed distance metrics, applying uniform attention weighting regardless of image content and limiting adaptability to diverse visual scenarios. Inspired by recent advances in large language models where content-aware gating mechanisms (e.g., GLA, HGRN2, FOX) significantly outperform static alternatives, we present the first successful adaptation of data-dependent spatial decay to 2D vision transformers. We introduce \textbf{Spatial Decay Transformer (SDT)}, featuring a novel Context-Aware Gating (CAG) mechanism that generates dynamic, data-dependent decay for patch interactions. Our approach learns to modulate spatial attention based on both content relevance and spatial proximity. We address the fundamental challenge of 1D-to-2D adaptation through a unified spatial-content fusion framework that integrates manhattan distance-based spatial priors with learned content representations. Extensive experiments on ImageNet-1K classification and generation tasks demonstrate consistent improvements over strong baselines. Our work establishes data-dependent spatial decay as a new paradigm for enhancing spatial attention in vision transformers.

[333] Beyond Cosine Similarity Magnitude-Aware CLIP for No-Reference Image Quality Assessment

Zhicheng Liao, Dongxu Wu, Zhenshan Shi, Sijie Mai, Hanwei Zhu, Lingyu Zhu, Yuncheng Jiang, Baoliang Chen

Main category: cs.CV

TL;DR: Novel adaptive fusion framework for NR-IQA that combines CLIP’s cosine similarity with magnitude-aware quality cues, outperforming existing methods without task-specific training.

Details

Motivation: Current CLIP-based NR-IQA methods only use cosine similarity between image embeddings and textual prompts, ignoring the important cue of CLIP image feature magnitudes which empirically correlate with perceptual quality.

Method: Extract absolute CLIP image features, apply Box-Cox transformation for statistical normalization, use resulting scalar as auxiliary quality cue, and design confidence-guided fusion scheme to adaptively combine with cosine-based prompt matching.

Result: Extensive experiments on multiple benchmark IQA datasets show the method consistently outperforms standard CLIP-based IQA and state-of-the-art baselines.

Conclusion: The magnitude of CLIP image features provides valuable quality cues, and adaptive fusion of magnitude-aware and cosine-based signals significantly improves NR-IQA performance without requiring task-specific training.

Abstract: Recent efforts have repurposed the Contrastive Language-Image Pre-training (CLIP) model for No-Reference Image Quality Assessment (NR-IQA) by measuring the cosine similarity between the image embedding and textual prompts such as “a good photo” or “a bad photo.” However, this semantic similarity overlooks a critical yet underexplored cue: the magnitude of the CLIP image features, which we empirically find to exhibit a strong correlation with perceptual quality. In this work, we introduce a novel adaptive fusion framework that complements cosine similarity with a magnitude-aware quality cue. Specifically, we first extract the absolute CLIP image features and apply a Box-Cox transformation to statistically normalize the feature distribution and mitigate semantic sensitivity. The resulting scalar summary serves as a semantically-normalized auxiliary cue that complements cosine-based prompt matching. To integrate both cues effectively, we further design a confidence-guided fusion scheme that adaptively weighs each term according to its relative strength. Extensive experiments on multiple benchmark IQA datasets demonstrate that our method consistently outperforms standard CLIP-based IQA and state-of-the-art baselines, without any task-specific training.

[334] STAGNet: A Spatio-Temporal Graph and LSTM Framework for Accident Anticipation

Vipooshan Vipulananthan, Kumudu Mohottala, Kavindu Chinthana, Nimsara Paramulla, Charith D Chitraranjan

Main category: cs.CV

TL;DR: STAGNet model improves accident prediction from dash-cam videos using enhanced spatio-temporal features and recurrent networks, outperforming previous methods across multiple datasets.

Details

Motivation: Accident prediction is crucial for road safety in ADAS and autonomous vehicles. While existing systems use multiple sensors (LiDAR, radar, GPS), dash-cam videos offer a more cost-effective and easily deployable solution, though more challenging.

Method: Proposed STAGNet model incorporates improved spatio-temporal features and aggregates them through a recurrent network to enhance state-of-the-art graph neural networks for accident prediction from dash-cam videos.

Result: Experiments on three public datasets (DAD, DoTA, DADA) show STAGNet achieves higher average precision and mean time-to-accident scores than previous methods, both in cross-validation and cross-dataset testing scenarios.

Conclusion: The proposed approach demonstrates superior performance for accident prediction from dash-cam videos, offering a cost-effective alternative to multi-sensor systems while maintaining high accuracy.

Abstract: Accident prediction and timely preventive actions improve road safety by reducing the risk of injury to road users and minimizing property damage. Hence, they are critical components of advanced driver assistance systems (ADAS) and autonomous vehicles. While many existing systems depend on multiple sensors such as LiDAR, radar, and GPS, relying solely on dash-cam videos presents a more challenging, yet more cost-effective and easily deployable solution. In this work, we incorporate improved spatio-temporal features and aggregate them through a recurrent network to enhance state-of-the-art graph neural networks for predicting accidents from dash-cam videos. Experiments using three publicly available datasets (DAD, DoTA and DADA) show that our proposed STAGNet model achieves higher average precision and mean time-to-accident scores than previous methods, both when cross-validated on a given dataset and when trained and tested on different datasets.

Hao Yin, Xin Man, Feiyu Chen, Jie Shao, Heng Tao Shen

Main category: cs.CV

TL;DR: FMFA is a cross-modal full-mode fine-grained alignment framework for text-to-image person retrieval that enhances global matching through explicit fine-grained alignment and implicit relational reasoning without extra supervision.

Details

Motivation: Existing TIPR methods lack verification of local feature alignment and focus too much on hard negative samples while neglecting incorrectly matched positive pairs, limiting cross-modal matching robustness.

Method: Proposes FMFA with two modules: 1) Adaptive Similarity Distribution Matching (A-SDM) to rectify unmatched positive pairs by adaptively pulling them closer in embedding space, and 2) Explicit Fine-grained Alignment (EFA) that strengthens explicit cross-modal interactions through similarity matrix sparsification and hard coding for local alignment.

Result: Achieves state-of-the-art results on three public datasets among all global matching methods for text-to-image person retrieval.

Conclusion: FMFA effectively addresses limitations of prior methods by combining explicit fine-grained alignment with implicit relational reasoning, improving cross-modal matching without additional supervision signals.

Abstract: Text-to-Image Person Retrieval (TIPR) is a cross-modal matching task designed to identify the person images that best correspond to a given textual description. The key difficulty in TIPR is to realize robust correspondence between the textual and visual modalities within a unified latent representation space. To address this challenge, prior approaches incorporate attention mechanisms for implicit cross-modal local alignment. However, they lack the ability to verify whether all local features are correctly aligned. Moreover, existing methods tend to emphasize the utilization of hard negative samples during model optimization to strengthen discrimination between positive and negative pairs, often neglecting incorrectly matched positive pairs. To mitigate these problems, we propose FMFA, a cross-modal Full-Mode Fine-grained Alignment framework, which enhances global matching through explicit fine-grained alignment and existing implicit relational reasoning – hence the term ``full-mode’’ – without introducing extra supervisory signals. In particular, we propose an Adaptive Similarity Distribution Matching (A-SDM) module to rectify unmatched positive sample pairs. A-SDM adaptively pulls the unmatched positive pairs closer in the joint embedding space, thereby achieving more precise global alignment. Additionally, we introduce an Explicit Fine-grained Alignment (EFA) module, which makes up for the lack of verification capability of implicit relational reasoning. EFA strengthens explicit cross-modal fine-grained interactions by sparsifying the similarity matrix and employs a hard coding method for local alignment. We evaluate our method on three public datasets, where it attains state-of-the-art results among all global matching methods. The code for our method is publicly accessible at https://github.com/yinhao1102/FMFA.

[336] A Novel Metric for Detecting Memorization in Generative Models for Brain MRI Synthesis

Antonio Scardace, Lemuel Puglisi, Francesco Guarnera, Sebastiano Battiato, Daniele Ravì

Main category: cs.CV

TL;DR: DeepSSIM: A self-supervised metric for detecting memorization in medical image generative models, improving F1 scores by 52% over existing methods.

Details

Motivation: Deep generative models in medical imaging can memorize sensitive training data, risking patient privacy. Current methods struggle to detect this memorization at scale.

Method: DeepSSIM learns to project images into an embedding space where cosine similarity matches ground-truth SSIM scores. Uses structure-preserving augmentations to capture anatomical features without requiring precise spatial alignment.

Result: Tested on synthetic brain MRI data from LDM trained on 2,195 MRI scans (IXI and CoRR datasets). Achieved superior performance with average F1 score improvement of +52.03% over best existing method.

Conclusion: DeepSSIM provides an effective, scalable solution for detecting memorization in medical image generative models, addressing critical privacy concerns in healthcare AI.

Abstract: Deep generative models have emerged as a transformative tool in medical imaging, offering substantial potential for synthetic data generation. However, recent empirical studies highlight a critical vulnerability: these models can memorize sensitive training data, posing significant risks of unauthorized patient information disclosure. Detecting memorization in generative models remains particularly challenging, necessitating scalable methods capable of identifying training data leakage across large sets of generated samples. In this work, we propose DeepSSIM, a novel self-supervised metric for quantifying memorization in generative models. DeepSSIM is trained to: i) project images into a learned embedding space and ii) force the cosine similarity between embeddings to match the ground-truth SSIM (Structural Similarity Index) scores computed in the image space. To capture domain-specific anatomical features, training incorporates structure-preserving augmentations, allowing DeepSSIM to estimate similarity reliably without requiring precise spatial alignment. We evaluate DeepSSIM in a case study involving synthetic brain MRI data generated by a Latent Diffusion Model (LDM) trained under memorization-prone conditions, using 2,195 MRI scans from two publicly available datasets (IXI and CoRR). Compared to state-of-the-art memorization metrics, DeepSSIM achieves superior performance, improving F1 scores by an average of +52.03% over the best existing method. Code and data of our approach are publicly available at the following link: https://github.com/brAIn-science/DeepSSIM.

[337] $\mathbf{R}^3$: Reconstruction, Raw, and Rain: Deraining Directly in the Bayer Domain

Nate Rothschild, Moshe Kimhi, Avi Mendelson, Chaim Baskin

Main category: cs.CV

TL;DR: Using raw Bayer data instead of processed sRGB images yields better rain removal with less computation, advocating for ISP-last pipelines in low-level vision tasks.

Details

Motivation: Current image reconstruction networks use post-ISP sRGB images which lose color information, clip dynamic range, and blur details due to irreversible processing. The paper aims to show these losses are avoidable by working directly with raw Bayer data.

Method: 1) Compare post-ISP vs Bayer reconstruction pipelines, 2) Create Raw-Rain benchmark with real rainy scenes in both 12-bit Bayer and bit-depth-matched sRGB, 3) Introduce Information Conservation Score (ICS) as a color-invariant metric aligned with human perception.

Result: Raw-domain model improves sRGB results by up to +0.99 dB PSNR and +1.2% ICS while running faster with half the GFLOPs. The raw approach outperforms traditional sRGB-based methods.

Conclusion: Advocates for ISP-last paradigm in low-level vision tasks and opens door to end-to-end learnable camera pipelines by demonstrating superior reconstruction from raw Bayer data compared to processed sRGB images.

Abstract: Image reconstruction from corrupted images is crucial across many domains. Most reconstruction networks are trained on post-ISP sRGB images, even though the image-signal-processing pipeline irreversibly mixes colors, clips dynamic range, and blurs fine detail. This paper uses the rain degradation problem as a use case to show that these losses are avoidable, and demonstrates that learning directly on raw Bayer mosaics yields superior reconstructions. To substantiate the claim, we (i) evaluate post-ISP and Bayer reconstruction pipelines, (ii) curate Raw-Rain, the first public benchmark of real rainy scenes captured in both 12-bit Bayer and bit-depth-matched sRGB, and (iii) introduce Information Conservation Score (ICS), a color-invariant metric that aligns more closely with human opinion than PSNR or SSIM. On the test split, our raw-domain model improves sRGB results by up to +0.99 dB PSNR and +1.2% ICS, while running faster with half of the GFLOPs. The results advocate an ISP-last paradigm for low-level vision and open the door to end-to-end learnable camera pipelines.

[338] Fully Automated Deep Learning Based Glenoid Bone Loss Measurement and Severity Stratification on 3D CT in Shoulder Instability

Zhonghao Liu, Hanxue Gu, Qihang Li, Michael Fox, Jay M. Levin, Maciej A. Mazurowski, Brian C. Lau

Main category: cs.CV

TL;DR: A fully automated deep learning pipeline for measuring glenoid bone loss on 3D CT scans using segmentation, landmark detection, and geometric fitting methods.

Details

Motivation: To develop a reliable, automated tool for measuring glenoid bone loss from CT scans to assist clinicians with preoperative planning for shoulder instability, addressing the need for consistent and accurate measurements.

Method: Three-stage pipeline: 1) U-Net segmentation of glenoid and humerus, 2) neural network prediction of glenoid rim points, 3) PCA, projection, and circle fitting for bone loss percentage calculation.

Result: Strong agreement with consensus readings (ICC 0.84 vs 0.78), outperforming surgeon-to-surgeon consistency, with good classification sensitivity (71.4% low-severity, 85.7% high-severity) and no misclassification between severity groups.

Conclusion: The automated deep learning pipeline is clinically reliable for glenoid bone loss measurement and can assist with preoperative planning; model and dataset are publicly released.

Abstract: To develop and validate a fully automated, deep-learning pipeline for measuring glenoid bone loss on 3D CT scans using linear-based, en-face view, and best-circle method. Shoulder CT scans of 81 patients were retrospectively collected between January 2013 and March 2023. Our algorithm consists of three main stages: (1) Segmentation, where we developed a U-Net to automatically segment the glenoid and humerus; (2) anatomical landmark detection, where a second network predicts glenoid rim points; and (3) geometric fitting, where we applied a principal component analysis (PCA), projection, and circle fitting to compute the percentage of bone loss. The performance of the pipeline was evaluated using DSC for segmentation and MAE and ICC for bone-loss measurement; intermediate outputs (rim point sets and en-face view) were also assessed. Automated measurements showed strong agreement with consensus readings, exceeding surgeon-to-surgeon consistency (ICC 0.84 vs 0.78 for all patients; ICC 0.71 vs 0.63 for low bone loss; ICC 0.83 vs 0.21 for high bone loss; P < 0.001). For the classification task of assigning each patient to different bone loss severity subgroups, the pipeline’s sensitivity was 71.4% for the low-severity group and 85.7% for the high-severity group, with no instances of misclassifying low as high or vice versa. A fully automated, deep learning-based pipeline for glenoid bone-loss measurement on CT scans can be a clinically reliable tool to assist clinicians with preoperative planning for shoulder instability. We are releasing our model and dataset at https://github.com/Edenliu1/Auto-Glenoid-Measurement-DL-Pipeline .

[339] Object-Centric Representation Learning for Enhanced 3D Scene Graph Prediction

KunHo Heo, GiHyun Kim, SuYeon Kim, MyeongAh Cho

Main category: cs.CV

TL;DR: The paper proposes a novel approach for 3D semantic scene graph prediction that focuses on improving object feature quality through a discriminative encoder and contrastive pretraining, leading to significant performance gains.

Details

Motivation: Previous 3D semantic scene graph prediction methods fail to optimize object and relationship feature representations, showing excessive reliance on Graph Neural Networks despite insufficient discriminative capability. The authors identify that object feature quality is critical for overall scene graph accuracy.

Method: The authors design a highly discriminative object feature encoder and employ a contrastive pretraining strategy that decouples object representation learning from scene graph prediction. They also effectively combine geometric and semantic features for relationship prediction, addressing the underutilization of relationship information in existing approaches.

Result: When plugging the pretrained encoder into existing frameworks, substantial performance improvements are observed across all evaluation metrics. Comprehensive experiments on the 3DSSG dataset demonstrate that the approach significantly outperforms previous state-of-the-art methods.

Conclusion: The paper shows that improving object feature quality through discriminative encoding and contrastive pretraining is crucial for 3D semantic scene graph prediction, and that effectively combining geometric and semantic features leads to superior relationship prediction performance.

Abstract: 3D Semantic Scene Graph Prediction aims to detect objects and their semantic relationships in 3D scenes, and has emerged as a crucial technology for robotics and AR/VR applications. While previous research has addressed dataset limitations and explored various approaches including Open-Vocabulary settings, they frequently fail to optimize the representational capacity of object and relationship features, showing excessive reliance on Graph Neural Networks despite insufficient discriminative capability. In this work, we demonstrate through extensive analysis that the quality of object features plays a critical role in determining overall scene graph accuracy. To address this challenge, we design a highly discriminative object feature encoder and employ a contrastive pretraining strategy that decouples object representation learning from the scene graph prediction. This design not only enhances object classification accuracy but also yields direct improvements in relationship prediction. Notably, when plugging in our pretrained encoder into existing frameworks, we observe substantial performance improvements across all evaluation metrics. Additionally, whereas existing approaches have not fully exploited the integration of relationship information, we effectively combine both geometric and semantic features to achieve superior relationship prediction. Comprehensive experiments on the 3DSSG dataset demonstrate that our approach significantly outperforms previous state-of-the-art methods. Our code is publicly available at https://github.com/VisualScienceLab-KHU/OCRL-3DSSG-Codes.

[340] IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation

Zeteng Lin, Xingxing Li, Wen You, Xiaoyang Li, Zehan Lu, Yujun Cai, Jing Tang

Main category: cs.CV

TL;DR: IUT-Plug enhances vision-language models with explicit structured reasoning using an Image Understanding Tree to reduce context drift in logic, object identity, and style during multimodal generation.

Details

Motivation: Existing VLMs like GPT-4 and DALL.E struggle to preserve logic, object identity, and style in multimodal image-text generation, limiting their generalization in complex image-text scenarios.

Method: Two-stage framework: (1) dynamic IUT-Plug extraction module parses visual scenes into hierarchical symbolic structures, (2) coordinated narrative-flow and image synthesis mechanism ensures cross-modal consistency.

Result: IUT-Plug improves accuracy on established benchmarks and effectively alleviates three critical forms of context drift across diverse multimodal QA scenarios, validated on a novel benchmark of 3,000 human-generated QA pairs.

Conclusion: The proposed IUT-Plug module grounded in Image Understanding Tree enhances VLMs through explicit structured reasoning, mitigating context drift and improving performance in complex multimodal generation tasks.

Abstract: Existing vision language models (VLMs), including GPT-4 and DALL.E, often struggle to preserve logic, object identity, and style in multimodal image-text generation. This limitation significantly hinders the generalization capability of VLMs in complex image-text input-output scenarios. To address this issue, we propose IUT-Plug, a module grounded in an Image Understanding Tree (IUT), which enhances existing interleaved VLMs through explicit structured reasoning, thereby mitigating context drift in logic, entity identity, and style. The proposed framework operates in two stages. (1) A dynamic IUT-Plug extraction module parses visual scenes into hierarchical symbolic structures. (2) A coordinated narrative-flow and image synthesis mechanism ensures cross-modal consistency. To evaluate our approach, we construct a novel benchmark based on 3,000 real human-generated question-answer pairs over fine-tuned large models, introducing a dynamic evaluation protocol for quantifying context drift in interleaved VLMs. Experimental results demonstrate that IUT-Plug not only improves accuracy on established benchmarks but also effectively alleviates the three critical forms of context drift across diverse multimodal question answering (QA) scenarios.

[341] Timepoint-Specific Benchmarking of Deep Learning Models for Glioblastoma Follow-Up MRI

Wenhao Guo, Golrokh Mirzaei

Main category: cs.CV

TL;DR: Deep learning models for glioblastoma progression vs pseudoprogression show comparable accuracy (~70-74%) across early follow-ups, with Mamba+CNN hybrid offering best accuracy-efficiency trade-off, though overall discrimination remains modest due to dataset challenges.

Details

Motivation: Differentiating true tumor progression from treatment-related pseudoprogression in glioblastoma is clinically challenging, especially at early follow-up. There's a need for stage-specific benchmarking of deep learning models to understand how performance varies across different post-radiation therapy time points.

Method: Analyzed 11 representative DL families (CNNs, LSTMs, hybrids, transformers, selective state-space models) using the Burdenko GBM Progression cohort (n=180). Trained under unified QC-driven pipeline with patient-level cross-validation, analyzing different post-RT scans independently to test architecture performance dependence on time-point.

Result: Accuracies were comparable across stages (~0.70-0.74), but discrimination improved at second follow-up with F1 and AUC increases. Mamba+CNN hybrid offered best accuracy-efficiency trade-off, transformers delivered competitive AUCs at higher computational cost, lightweight CNNs were efficient but less reliable. Performance sensitive to batch size.

Conclusion: Overall discrimination remains modest, reflecting intrinsic difficulty of TP vs PsP and dataset imbalance. Results establish stage-aware benchmark and motivate future work incorporating longitudinal modeling, multi-sequence MRI, and larger multi-center cohorts.

Abstract: Differentiating true tumor progression (TP) from treatment-related pseudoprogression (PsP) in glioblastoma remains challenging, especially at early follow-up. We present the first stage-specific, cross-sectional benchmarking of deep learning models for follow-up MRI using the Burdenko GBM Progression cohort (n = 180). We analyze different post-RT scans independently to test whether architecture performance depends on time-point. Eleven representative DL families (CNNs, LSTMs, hybrids, transformers, and selective state-space models) were trained under a unified, QC-driven pipeline with patient-level cross-validation. Across both stages, accuracies were comparable (~0.70-0.74), but discrimination improved at the second follow-up, with F1 and AUC increasing for several models, indicating richer separability later in the care pathway. A Mamba+CNN hybrid consistently offered the best accuracy-efficiency trade-off, while transformer variants delivered competitive AUCs at substantially higher computational cost and lightweight CNNs were efficient but less reliable. Performance also showed sensitivity to batch size, underscoring the need for standardized training protocols. Notably, absolute discrimination remained modest overall, reflecting the intrinsic difficulty of TP vs. PsP and the dataset’s size imbalance. These results establish a stage-aware benchmark and motivate future work incorporating longitudinal modeling, multi-sequence MRI, and larger multi-center cohorts.

[342] A solution to generalized learning from small training sets found in infant repeated visual experiences of individual objects

Frangil Ramirez, Elizabeth Clerkin, David J. Crandall, Linda B. Smith

Main category: cs.CV

TL;DR: Infants’ daily visual experiences show skewed distributions of object instances with lumpy similarity structures that support rapid generalization to novel objects.

Details

Motivation: To understand how one-year-old infants achieve adult-like generalization of object categories despite limited understanding of their early visual experiences.

Method: Analyzed head camera images from 14 infants during 87 mealtimes, quantifying instance distributions and similarity structures for 8 object categories using graph theoretic measures and computational experiments.

Result: Infants experience highly skewed distributions with many images of few objects and fewer images of other instances, creating lumpy similarity structures with interconnected clusters that support rapid generalization.

Conclusion: The specific structure of infants’ visual experiences - skewed distributions and lumpy similarity patterns - enables rapid category generalization, with implications for both human development and machine learning.

Abstract: One-year-old infants show immediate adult-like generalization of common object categories to novel instances. The field has limited understanding of how this early prowess is achieved. Here we provide evidence on infants’ daily-life visual experiences for 8 early-learned object categories. Using a corpus of infant head camera images recorded at mealtimes (87 mealtimes captured by 14 infants), we quantify the individual instances experienced by infants and the similarity structure of all images containing an instance of each category. The distribution of instances is highly skewed, containing, for each infant and category, many images of the same few objects along with fewer images of other instances. Graph theoretic measures of the similarity structure for individual categories reveal a lumpy mix of high similarity and high variability, organized into multiple but interconnected clusters of high-similarity images. In computational experiments, we show that creating training sets that include an oversampling of varied images from a single instance yields a lumpy similarity structure. We also show that these artificially-created training sets support generalization to novel instances after very few training experiences. We discuss implications for the development of visual object recognition in both humans and machines.

[343] DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion

Weijie Wang, Jiagang Zhu, Zeyu Zhang, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Chaojun Ni, Haoxiao Wang, Guan Huang, Xinze Chen, Yukun Zhou, Wenkang Qin, Duochao Shi, Haoyun Li, Yicheng Xiao, Donny Y. Chen, Jiwen Lu

Main category: cs.CV

TL;DR: DriveGen3D is a framework for generating high-quality, controllable dynamic 3D driving scenes that combines efficient long-term video generation with large-scale 3D reconstruction using multimodal conditional control.

Details

Motivation: Current approaches have limitations: prohibitive computational demands for extended temporal generation, focus only on prolonged video synthesis without 3D representation, or restriction to static single-scene reconstruction. There's a methodological gap that needs bridging.

Method: Two-component unified pipeline: 1) FastDrive-DiT - efficient video diffusion transformer for high-resolution, temporally coherent video synthesis under text and Bird’s-Eye-View layout guidance; 2) FastRecon3D - feed-forward module that rapidly builds 3D Gaussian representations across time for spatial-temporal consistency.

Result: Enables generation of long driving videos (up to 800×424 at 12 FPS) and corresponding 3D scenes, achieving state-of-the-art results while maintaining efficiency.

Conclusion: DriveGen3D successfully bridges the methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction through multimodal conditional control, offering a novel solution for high-quality, controllable 3D driving scene generation.

Abstract: We present DriveGen3D, a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes that addresses critical limitations in existing methodologies. Current approaches to driving scene synthesis either suffer from prohibitive computational demands for extended temporal generation, focus exclusively on prolonged video synthesis without 3D representation, or restrict themselves to static single-scene reconstruction. Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction through multimodal conditional control. DriveGen3D introduces a unified pipeline consisting of two specialized components: FastDrive-DiT, an efficient video diffusion transformer for high-resolution, temporally coherent video synthesis under text and Bird’s-Eye-View (BEV) layout guidance; and FastRecon3D, a feed-forward module that rapidly builds 3D Gaussian representations across time, ensuring spatial-temporal consistency. DriveGen3D enable the generation of long driving videos (up to $800\times424$ at $12$ FPS) and corresponding 3D scenes, achieving state-of-the-art results while maintaining efficiency.

[344] RaindropGS: A Benchmark for 3D Gaussian Splatting under Raindrop Conditions

Zhiqiang Teng, Tingting Chen, Beibei Lin, Zifeng Yuan, Xuanyi Li, Xuanyu Zhang, Shunli Zhang

Main category: cs.CV

TL;DR: RaindropGS is a new benchmark for evaluating 3D Gaussian Splatting under real-world raindrop conditions, addressing limitations of existing synthetic-only evaluations by including unconstrained images, pose estimation challenges, and domain gaps.

Details

Motivation: 3DGS performance degrades severely under raindrop conditions due to occlusions and distortions. Existing benchmarks use synthetic raindrops with known poses, ignoring real-world challenges like inaccurate pose estimation and the synthetic-real domain gap.

Method: Created RaindropGS benchmark with three parts: data preparation (collecting real-world raindrop dataset with three aligned image sets), data processing, and raindrop-aware 3DGS evaluation including pose estimation, rain removal, and 3D Gaussian training comparisons.

Result: Revealed critical insights: camera focus position significantly affects 3DGS reconstruction, and inaccurate pose/point cloud initialization interferes with reconstruction. Established performance limitations of existing methods on unconstrained raindrop images.

Conclusion: RaindropGS provides comprehensive evaluation framework for full 3DGS pipeline under real raindrop conditions, offering clear directions for developing more robust 3DGS methods by addressing pose estimation, initialization, and focus-related challenges.

Abstract: 3D Gaussian Splatting (3DGS) under raindrop conditions suffers from severe occlusions and optical distortions caused by raindrop contamination on the camera lens, substantially degrading reconstruction quality. Existing benchmarks typically evaluate 3DGS using synthetic raindrop images with known camera poses (constrained images), assuming ideal conditions. However, in real-world scenarios, raindrops often interfere with accurate camera pose estimation and point cloud initialization. Moreover, a significant domain gap between synthetic and real raindrops further impairs generalization. To tackle these issues, we introduce RaindropGS, a comprehensive benchmark designed to evaluate the full 3DGS pipeline-from unconstrained, raindrop-corrupted images to clear 3DGS reconstructions. Specifically, the whole benchmark pipeline consists of three parts: data preparation, data processing, and raindrop-aware 3DGS evaluation, including types of raindrop interference, camera pose estimation and point cloud initialization, single image rain removal comparison, and 3D Gaussian training comparison. First, we collect a real-world raindrop reconstruction dataset, in which each scene contains three aligned image sets: raindrop-focused, background-focused, and rain-free ground truth, enabling a comprehensive evaluation of reconstruction quality under different focus conditions. Through comprehensive experiments and analyses, we reveal critical insights into the performance limitations of existing 3DGS methods on unconstrained raindrop images and the varying impact of different pipeline components: the impact of camera focus position on 3DGS reconstruction performance, and the interference caused by inaccurate pose and point cloud initialization on reconstruction. These insights establish clear directions for developing more robust 3DGS methods under raindrop conditions.

[345] Towards Generalisable Foundation Models for Brain MRI

Moona Mazher, Geoff J. M. Parker, Daniel C. Alexander

Main category: cs.CV

TL;DR: BrainFound is a 3D self-supervised foundation model for brain MRI that extends DINO-v2 to handle volumetric data, supporting multimodal inputs and outperforming existing methods in label-scarce settings.

Details

Motivation: Foundation models are transforming medical imaging, but existing approaches often focus on 2D natural images or single-slice paradigms. There's a need for models that can handle full 3D brain anatomy from MRI data, work with multimodal inputs, and perform well in label-scarce clinical scenarios.

Method: Extends DINO-v2 (vision transformer) to model full 3D brain anatomy by incorporating volumetric information from sequential MRI slices. Supports both single- and multimodal inputs (e.g., T1, T2, FLAIR) and enables various downstream tasks like disease detection and image segmentation.

Result: Consistently outperforms existing self-supervised pretraining strategies and supervised baselines, particularly in label-scarce and multi-contrast settings. Enhances diagnostic accuracy and reduces dependency on extensive expert annotations.

Conclusion: BrainFound provides a scalable and practical solution for 3D neuroimaging pipelines with significant potential for clinical deployment and research innovation, offering flexibility across varied imaging protocols and clinical scenarios.

Abstract: Foundation models in artificial intelligence (AI) are transforming medical imaging by enabling general-purpose feature learning from large-scale, unlabeled datasets. In this work, we introduce BrainFound, a self-supervised foundation model for brain MRI, built by extending DINO-v2, a vision transformer originally designed for 2D natural images. BrainFound adapts DINO-v2 to model full 3D brain anatomy by incorporating volumetric information from sequential MRI slices, moving beyond conventional single-slice paradigms. It supports both single- and multimodal inputs, enabling a broad range of downstream tasks, including disease detection and image segmentation, while generalising across varied imaging protocols and clinical scenarios. We show that BrainFound consistently outperforms existing self-supervised pretraining strategies and supervised baselines, particularly in label-scarce and multi-contrast settings. By integrating information from diverse 3D MRI modalities (e.g., T1, T2, FLAIR), it enhances diagnostic accuracy and reduces dependency on extensive expert annotations. This flexibility makes BrainFound a scalable and practical solution for 3D neuroimaging pipelines, with significant potential for clinical deployment and research innovation.

Xin Jin, Siyuan Li, Siyong Jian, Kai Yu, Huan Wang

Main category: cs.CV

TL;DR: MergeMix is a unified paradigm that bridges SFT and RL for vision-language alignment in MLLMs using Token Merge based Mixup augmentation and preference-driven optimization.

Details

Motivation: Current methods for aligning multi-modal large language models (MLLMs) have limitations: SFT requires human annotations and lacks task generalization, while RL suffers from computational overhead and instability. There's a need for a balanced approach that offers scalability, efficiency, and alignment generalization.

Method: MergeMix uses Token Merge based Mixup augmentation to generate contextual aligned mixed images with corresponding labels based on merged attention maps with cluster regions. It builds preference pairs between raw images and MergeMix-generated ones, then optimizes the soft preference margin using mixed SimPO loss.

Result: Extensive experiments show MergeMix achieves dominant classification accuracy as an augmentation method, improves generalization abilities and alignment of MLLMs, and provides a new learning paradigm for preference alignment with training efficiency and stability.

Conclusion: MergeMix successfully bridges SFT and RL, offering a balanced solution for vision-language alignment in MLLMs that addresses the limitations of existing methods while providing scalability, efficiency, and improved alignment generalization.

Abstract: Vision-language alignment in multi-modal large language models (MLLMs) relies on supervised fine-tuning (SFT) or reinforcement learning (RL). To align multi-modal large language models (MLLMs) in the post-training stage, supervised fine-tuning (SFT) is a stable choice but requires human annotations and lacks task generalizations, while Reinforcement Learning (RL) searches for better answers from reward signals but suffers from computational overhead and instability. To achieve balance among scalability, efficiency, and alignment generalizations, we propose MergeMix, a unified paradigm that bridges SFT and RL with an efficient Token Merge based Mixup augmentation. As for the Mixup policy, we generate contextual aligned mixed images with the corresponding labels according to the merged attention maps with cluster regions. Then, we enhance the preference-driven paradigm for MLLMs by building preference pairs with raw images and MergeMix-generated ones and optimizing the soft preference margin with the mixed SimPO loss. Extensive experiments demonstrate that MergeMix not only achieves dominant classification accuracy as an augmentation method but also improves generalization abilities and alignment of MLLMs, providing a new learning paradigm for preference alignment with training efficiency and stability.

[347] Class Incremental Medical Image Segmentation via Prototype-Guided Calibration and Dual-Aligned Distillation

Shengqian Zhu, Chengrong Yu, Qiang Wang, Ying Song, Guangjun Li, Jiafei Wu, Xiaogang Xu, Zhang Yi, Junjie Hu

Main category: cs.CV

TL;DR: PGCD and DAPD methods for class incremental medical image segmentation that use prototype-guided calibration and dual-aligned distillation to better preserve old knowledge while learning new classes.

Details

Motivation: Existing CIMIS methods have two main issues: 1) one-size-fits-all strategies treat all spatial regions and feature channels equally, hindering accurate old knowledge preservation; 2) methods focus only on aligning local prototypes with global ones for old classes while ignoring their local representations in new data, causing knowledge degradation.

Method: Two complementary techniques: 1) Prototype-Guided Calibration Distillation (PGCD) uses prototype-to-feature similarity to calibrate class-specific distillation intensity in different spatial regions, reinforcing reliable old knowledge and suppressing misleading information. 2) Dual-Aligned Prototype Distillation (DAPD) aligns local prototypes of old classes from the current model with both global prototypes and local prototypes to enhance segmentation performance on old categories.

Result: Comprehensive evaluations on two widely used multi-organ segmentation benchmarks demonstrate that the method outperforms state-of-the-art methods, highlighting its robustness and generalization capabilities.

Conclusion: The proposed PGCD and DAPD framework effectively addresses limitations of existing CIMIS methods by providing targeted distillation calibration and comprehensive prototype alignment, leading to superior performance in preserving old knowledge while learning new classes in medical image segmentation.

Abstract: Class incremental medical image segmentation (CIMIS) aims to preserve knowledge of previously learned classes while learning new ones without relying on old-class labels. However, existing methods 1) either adopt one-size-fits-all strategies that treat all spatial regions and feature channels equally, which may hinder the preservation of accurate old knowledge, 2) or focus solely on aligning local prototypes with global ones for old classes while overlooking their local representations in new data, leading to knowledge degradation. To mitigate the above issues, we propose Prototype-Guided Calibration Distillation (PGCD) and Dual-Aligned Prototype Distillation (DAPD) for CIMIS in this paper. Specifically, PGCD exploits prototype-to-feature similarity to calibrate class-specific distillation intensity in different spatial regions, effectively reinforcing reliable old knowledge and suppressing misleading information from old classes. Complementarily, DAPD aligns the local prototypes of old classes extracted from the current model with both global prototypes and local prototypes, further enhancing segmentation performance on old categories. Comprehensive evaluations on two widely used multi-organ segmentation benchmarks demonstrate that our method outperforms state-of-the-art methods, highlighting its robustness and generalization capabilities.

[348] D$^{2}$-VPR: A Parameter-efficient Visual-foundation-model-based Visual Place Recognition Method via Knowledge Distillation and Deformable Aggregation

Zheyuan Zhang, Jiwei Zhang, Boyu Zhou, Linzhimeng Duan, Hong Chen

Main category: cs.CV

TL;DR: D²-VPR: A distillation- and deformable-based framework for Visual Place Recognition that reduces model parameters by ~64.2% while maintaining competitive performance with state-of-the-art methods.

Details

Motivation: While DINOv2 foundation models improve VPR performance through strong feature generalization, they come with high model complexity and computational overhead that hinder deployment on resource-constrained devices.

Method: Two-stage training with knowledge distillation and fine-tuning, plus a Distillation Recovery Module (DRM) to align teacher-student feature spaces. Also includes a Top-Down-attention-based Deformable Aggregator (TDDA) that dynamically adjusts Regions of Interest using global semantic features.

Result: Achieves competitive performance compared to state-of-the-art approaches while reducing parameter count by approximately 64.2% compared to CricaVPR.

Conclusion: D²-VPR successfully balances performance and efficiency, retaining strong feature extraction capabilities while significantly reducing computational requirements for practical deployment.

Abstract: Visual Place Recognition (VPR) aims to determine the geographic location of a query image by retrieving its most visually similar counterpart from a geo-tagged reference database. Recently, the emergence of the powerful visual foundation model, DINOv2, trained in a self-supervised manner on massive datasets, has significantly improved VPR performance. This improvement stems from DINOv2’s exceptional feature generalization capabilities but is often accompanied by increased model complexity and computational overhead that impede deployment on resource-constrained devices. To address this challenge, we propose $D^{2}$-VPR, a $D$istillation- and $D$eformable-based framework that retains the strong feature extraction capabilities of visual foundation models while significantly reducing model parameters and achieving a more favorable performance-efficiency trade-off. Specifically, first, we employ a two-stage training strategy that integrates knowledge distillation and fine-tuning. Additionally, we introduce a Distillation Recovery Module (DRM) to better align the feature spaces between the teacher and student models, thereby minimizing knowledge transfer losses to the greatest extent possible. Second, we design a Top-Down-attention-based Deformable Aggregator (TDDA) that leverages global semantic features to dynamically and adaptively adjust the Regions of Interest (ROI) used for aggregation, thereby improving adaptability to irregular structures. Extensive experiments demonstrate that our method achieves competitive performance compared to state-of-the-art approaches. Meanwhile, it reduces the parameter count by approximately 64.2% (compared to CricaVPR).Code is available at https://github.com/tony19980810/D2VPR.

[349] MCAQ-YOLO: Morphological Complexity-Aware Quantization for Efficient Object Detection with Curriculum Learning

Yoonjae Seo, Ermal Elbasani, Jaehong Lee

Main category: cs.CV

TL;DR: MCAQ-YOLO introduces tile-wise spatial mixed-precision quantization for object detectors using morphological complexity metrics to allocate bits efficiently, achieving high performance with minimal overhead.

Details

Motivation: Most neural network quantization methods use uniform bit precision across spatial regions, ignoring the heterogeneous complexity in visual data. This leads to inefficient bit allocation where simple regions waste bits and complex regions lack sufficient precision.

Method: Proposes morphological complexity (measured by fractal dimension, texture entropy, gradient variance, edge density, and contour complexity) as signal-centric predictor of quantization sensitivity. Uses calibration-time analysis for spatial bit allocation with only 0.3ms overhead. Introduces curriculum-based training that progressively increases quantization difficulty to stabilize optimization.

Result: On construction safety equipment dataset: 85.6% mAP@0.5 with average 4.2 bits and 7.6x compression, outperforming uniform 4-bit quantization by 3.5 percentage points. Cross-dataset evaluation shows consistent improvements: COCO 2017 (+2.9%) and Pascal VOC 2012 (+2.3%). Achieves 151 FPS throughput.

Conclusion: MCAQ-YOLO demonstrates that spatial mixed-precision quantization guided by morphological complexity metrics enables efficient bit allocation for real-time object detection, with performance gains correlating with within-image complexity variation.

Abstract: Most neural network quantization methods apply uniform bit precision across spatial regions, disregarding the heterogeneous complexity inherent in visual data. This paper introduces MCAQ-YOLO, a practical framework for tile-wise spatial mixed-precision quantization in real-time object detectors. Morphological complexity–quantified through five complementary metrics (fractal dimension, texture entropy, gradient variance, edge density, and contour complexity)–is proposed as a signal-centric predictor of spatial quantization sensitivity. A calibration-time analysis design enables spatial bit allocation with only 0.3ms inference overhead, achieving 151 FPS throughput. Additionally, a curriculum-based training scheme that progressively increases quantization difficulty is introduced to stabilize optimization and accelerate convergence. On a construction safety equipment dataset exhibiting high morphological variability, MCAQ-YOLO achieves 85.6% mAP@0.5 with an average bit-width of 4.2 bits and a 7.6x compression ratio, outperforming uniform 4-bit quantization by 3.5 percentage points. Cross-dataset evaluation on COCO 2017 (+2.9%) and Pascal VOC 2012 (+2.3%) demonstrates consistent improvements, with performance gains correlating with within-image complexity variation.

Zhenguo Zhang, Haohan Zheng, Yishen Wang, Le Xu, Tianchen Deng, Xuefeng Chen, Qu Chen, Bo Zhang, Wuxiong Huang

Main category: cs.CV

TL;DR: OmniDrive-R1 is an end-to-end VLM framework for autonomous driving that addresses object hallucination through interleaved multi-modal Chain-of-Thought reasoning and reinforcement-driven visual grounding, eliminating the need for dense localization labels.

Details

Motivation: Vision-Language Models (VLMs) in autonomous driving suffer from reliability failures like object hallucination due to ungrounded text-based reasoning. Existing multi-modal CoT approaches have flaws: decoupled perception/reasoning stages preventing end-to-end optimization, and reliance on expensive dense localization labels.

Method: OmniDrive-R1 uses an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism that unifies perception and reasoning. It features reinforcement-driven visual grounding that autonomously directs attention to critical regions. The training uses a two-stage reinforcement learning pipeline with Clip-GRPO algorithm, which introduces an annotation-free, process-based grounding reward enforcing cross-modal consistency.

Result: On DriveLMM-o1 benchmark, OmniDrive-R1 improves overall reasoning score from 51.77% to 80.35% and final answer accuracy from 37.81% to 73.62% compared to baseline Qwen2.5VL-7B.

Conclusion: The proposed end-to-end VLM framework with iMCoT and reinforcement-driven visual grounding effectively addresses object hallucination in autonomous driving applications, achieving significant performance improvements without requiring expensive dense labels.

Abstract: The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) reasoning. While existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization labels. Thus we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and “zoom in” on critical regions for fine-grained analysis. This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm. Crucially, Clip-GRPO introduces an annotation-free, process-based grounding reward. This reward not only eliminates the need for dense labels but also circumvents the instability of external tool calls by enforcing real-time cross-modal consistency between the visual focus and the textual reasoning. Extensive experiments on DriveLMM-o1 demonstrate our model’s significant improvements. Compared to the baseline Qwen2.5VL-7B, OmniDrive-R1 improves the overall reasoning score from 51.77% to 80.35%, and the final answer accuracy from 37.81% to 73.62%.

[351] RefineVAD: Semantic-Guided Feature Recalibration for Weakly Supervised Video Anomaly Detection

Junhee Lee, ChaeBeen Bang, MyoungChul Kim, MyeongAh Cho

Main category: cs.CV

TL;DR: RefineVAD is a weakly-supervised video anomaly detection framework that mimics human dual-process reasoning by jointly modeling temporal motion patterns and semantic category structures, addressing the oversimplification of treating all anomalies as a single category.

Details

Motivation: Existing weakly-supervised video anomaly detection methods oversimplify the anomaly space by treating all abnormal events as a single category, overlooking the diverse semantic and temporal characteristics of real-world anomalies. The authors are inspired by how humans perceive anomalies through joint interpretation of temporal motion patterns and semantic structures.

Method: RefineVAD integrates two core modules: 1) Motion-aware Temporal Attention and Recalibration (MoTAR) estimates motion salience and dynamically adjusts temporal focus using shift-based attention and global Transformer-based modeling; 2) Category-Oriented Refinement (CORE) injects soft anomaly category priors by aligning segment-level features with learnable category prototypes through cross-attention.

Result: Extensive experiments on WVAD benchmark validate the effectiveness of RefineVAD and highlight the importance of integrating semantic context to guide feature refinement toward anomaly-relevant patterns.

Conclusion: The proposed framework successfully addresses the limitation of oversimplified anomaly modeling by explicitly modeling both “how” motion evolves and “what” semantic category it resembles, demonstrating the value of integrating temporal dynamics with semantic structure for more accurate anomaly detection.

Abstract: Weakly-Supervised Video Anomaly Detection aims to identify anomalous events using only video-level labels, balancing annotation efficiency with practical applicability. However, existing methods often oversimplify the anomaly space by treating all abnormal events as a single category, overlooking the diverse semantic and temporal characteristics intrinsic to real-world anomalies. Inspired by how humans perceive anomalies, by jointly interpreting temporal motion patterns and semantic structures underlying different anomaly types, we propose RefineVAD, a novel framework that mimics this dual-process reasoning. Our framework integrates two core modules. The first, Motion-aware Temporal Attention and Recalibration (MoTAR), estimates motion salience and dynamically adjusts temporal focus via shift-based attention and global Transformer-based modeling. The second, Category-Oriented Refinement (CORE), injects soft anomaly category priors into the representation space by aligning segment-level features with learnable category prototypes through cross-attention. By jointly leveraging temporal dynamics and semantic structure, explicitly models both “how” motion evolves and “what” semantic category it resembles. Extensive experiments on WVAD benchmark validate the effectiveness of RefineVAD and highlight the importance of integrating semantic context to guide feature refinement toward anomaly-relevant patterns.

[352] BootOOD: Self-Supervised Out-of-Distribution Detection via Synthetic Sample Exposure under Neural Collapse

Yuanchao Wang, Tian Qin, Eduardo Valle, Bruno Abrahao

Main category: cs.CV

TL;DR: BootOOD is a self-supervised OOD detection framework that bootstraps from ID data, uses Neural Collapse properties, and introduces radius-based classification on feature norms to better handle semantically similar OOD samples.

Details

Motivation: Existing OOD detectors struggle when OOD samples are semantically similar to in-distribution classes, creating safety risks in real-world deployments. There's a need for methods that can handle these challenging cases without requiring external OOD data.

Method: BootOOD synthesizes pseudo-OOD features through simple transformations of ID representations, leverages Neural Collapse properties, and introduces a lightweight auxiliary head that performs radius-based classification on feature norms. This decouples OOD detection from the primary classifier and only requires OOD samples to have smaller feature norms than ID features.

Result: BootOOD outperforms prior post-hoc methods, surpasses training-based methods without outlier exposure, and is competitive with state-of-the-art outlier-exposure approaches while maintaining or improving ID accuracy on CIFAR-10, CIFAR-100, and ImageNet-200.

Conclusion: BootOOD provides an effective self-supervised approach for OOD detection that handles semantically challenging cases by leveraging Neural Collapse properties and radius-based feature norm classification, achieving strong performance without requiring external OOD data.

Abstract: Out-of-distribution (OOD) detection is critical for deploying image classifiers in safety-sensitive environments, yet existing detectors often struggle when OOD samples are semantically similar to the in-distribution (ID) classes. We present BootOOD, a fully self-supervised OOD detection framework that bootstraps exclusively from ID data and is explicitly designed to handle semantically challenging OOD samples. BootOOD synthesizes pseudo-OOD features through simple transformations of ID representations and leverages Neural Collapse (NC), where ID features cluster tightly around class means with consistent feature norms. Unlike prior approaches that aim to constrain OOD features into subspaces orthogonal to the collapsed ID means, BootOOD introduces a lightweight auxiliary head that performs radius-based classification on feature norms. This design decouples OOD detection from the primary classifier and imposes a relaxed requirement: OOD samples are learned to have smaller feature norms than ID features, which is easier to satisfy when ID and OOD are semantically close. Experiments on CIFAR-10, CIFAR-100, and ImageNet-200 show that BootOOD outperforms prior post-hoc methods, surpasses training-based methods without outlier exposure, and is competitive with state-of-the-art outlier-exposure approaches while maintaining or improving ID accuracy.

[353] InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models

Sarah Rastegar, Violeta Chatalbasheva, Sieger Falkena, Anuj Singh, Yanbo Wang, Tejas Gokhale, Hamid Palangi, Hadi Jamali-Rad

Main category: cs.CV

TL;DR: InfSplign is a training-free inference-time method that improves spatial alignment in text-to-image diffusion models by adjusting noise through a compound loss using cross-attention maps.

Details

Motivation: T2I diffusion models generate high-quality images but often fail to capture spatial relations specified in text prompts due to lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics.

Method: InfSplign adjusts noise through a compound loss in every denoising step, leveraging different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and balanced object presence during sampling.

Result: InfSplign establishes new SOTA on VISOR and T2I-CompBench, achieving substantial performance gains over existing inference-time baselines and even outperforming fine-tuning-based methods.

Conclusion: The method is lightweight, plug-and-play, compatible with any diffusion backbone, and provides training-free inference-time improvement for spatial alignment in T2I diffusion models.

Abstract: Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub.

[354] MambaIO: Global-Coordinate Inertial Odometry for Pedestrians via Multi-Scale Frequency-Decoupled Modeling

Shanshan Zhang, Liqin Wu, Wenying Cao, Siyue Wang, Tianshui Wen, Qi Zhang, Xuemin Hong, Ao Peng, Lingxiang Zheng, Yu Yang

Main category: cs.CV

TL;DR: MambaIO: A novel inertial odometry method using Mamba architecture with Laplacian pyramid decomposition for pedestrian localization, achieving state-of-the-art performance by processing IMU measurements in body frame rather than global frame.

Details

Motivation: Traditional inertial odometry transforms IMU measurements to global frame for smoother motion, but recent drone studies show body frame improves accuracy. This paper re-evaluates global frame suitability for pedestrian IO and proposes a better approach.

Method: MambaIO decomposes IMU measurements into high/low-frequency components using Laplacian pyramid. Low-frequency processed by Mamba architecture for contextual motion cues, high-frequency by convolutional structure for fine-grained details. Works in body frame.

Result: Experiments on multiple public datasets show MambaIO substantially reduces localization error and achieves state-of-the-art performance. First application of Mamba architecture to inertial odometry task.

Conclusion: Body frame is more effective than global frame for pedestrian inertial odometry. MambaIO’s frequency decomposition with Mamba architecture successfully captures both contextual and detailed motion patterns, setting new SOTA performance.

Abstract: Inertial Odometry (IO) enables real-time localization using only acceleration and angular velocity measurements from an Inertial Measurement Unit (IMU), making it a promising solution for localization in consumer-grade applications. Traditionally, researchers have routinely transformed IMU measurements into the global frame to obtain smoother motion representations. However, recent studies in drone scenarios have demonstrated that the body frame can significantly improve localization accuracy, prompting a re-evaluation of the suitability of the global frame for pedestrian IO. To address this issue, this paper systematically evaluates the effectiveness of the global frame in pedestrian IO through theoretical analysis, qualitative inspection, and quantitative experiments. Building upon these findings, we further propose MambaIO, which decomposes IMU measurements into high-frequency and low-frequency components using a Laplacian pyramid. The low-frequency component is processed by a Mamba architecture to extract implicit contextual motion cues, while the high-frequency component is handled by a convolutional structure to capture fine-grained local motion details. Experiments on multiple public datasets show that MambaIO substantially reduces localization error and achieves state-of-the-art (SOTA) performance. To the best of our knowledge, this is the first application of the Mamba architecture to the IO task.

[355] SPIDER: Spatial Image CorresponDence Estimator for Robust Calibration

Zhimin Shao, Abhay Yadav, Rama Chellappa, Cheng Peng

Main category: cs.CV

TL;DR: SPIDER is a universal feature matching framework that combines 2D and 3D correspondence estimation to handle challenging cross-domain image matching with large viewpoint changes.

Details

Motivation: Current feature matching methods struggle with unconstrained scenarios across domains (aerial, indoor, outdoor) due to appearance, scale, and viewpoint variations. While recent 3D foundation models offer spatial coherence, they focus on dominant planar regions and miss fine-grained geometric details, especially under large viewpoint changes.

Method: SPIDER integrates a shared feature extraction backbone with two specialized network heads: one for 2D-based correspondences and another for 3D-based correspondences, working from coarse to fine. The approach builds on insights from linear probe experiments evaluating various vision foundation models for image matching.

Result: SPIDER significantly outperforms state-of-the-art methods, demonstrating strong performance as a universal image-matching method. The paper also introduces a new image-matching evaluation benchmark focused on unconstrained scenarios with large baselines.

Conclusion: The proposed SPIDER framework effectively addresses limitations of both conventional 2D feature matching and recent 3D foundation models by combining their strengths, resulting in a universal solution for challenging cross-domain image correspondence problems with large viewpoint variations.

Abstract: Reliable image correspondences form the foundation of vision-based spatial perception, enabling recovery of 3D structure and camera poses. However, unconstrained feature matching across domains such as aerial, indoor, and outdoor scenes remains challenging due to large variations in appearance, scale and viewpoint. Feature matching has been conventionally formulated as a 2D-to-2D problem; however, recent 3D foundation models provides spatial feature matching properties based on two-view geometry. While powerful, we observe that these spatially coherent matches often concentrate on dominant planar regions, e.g., walls or ground surfaces, while being less sensitive to fine-grained geometric details, particularly under large viewpoint changes. To better understand these trade-offs, we first perform linear probe experiments to evaluate the performance of various vision foundation models for image matching. Building on these insights, we introduce SPIDER, a universal feature matching framework that integrates a shared feature extraction backbone with two specialized network heads for estimating both 2D-based and 3D-based correspondences from coarse to fine. Finally, we introduce an image-matching evaluation benchmark that focuses on unconstrained scenarios with large baselines. SPIDER significantly outperforms SoTA methods, demonstrating its strong ability as a universal image-matching method.

[356] Learning Visual Affordance from Audio

Lidong Lu, Guo Chen, Zhu Wei, Yicheng Liu, Tong Lu

Main category: cs.CV

TL;DR: AV-AG is a new task for segmenting object interaction regions from action sounds, using audio as real-time, semantically rich cues instead of text or videos. The paper introduces the first AV-AG dataset and AVAGFormer model with cross-modal fusion for mask prediction.

Details

Motivation: Existing approaches for affordance grounding rely on textual instructions or demonstration videos, which suffer from ambiguity or occlusion. Audio provides real-time, semantically rich, and visually independent cues that enable more intuitive understanding of interaction regions.

Method: Proposes AVAGFormer with a semantic-conditioned cross-modal mixer and dual-head decoder that effectively fuses audio and visual signals for mask prediction. Constructs the first AV-AG dataset with action sounds, object images, and pixel-level affordance annotations, including an unseen subset for zero-shot evaluation.

Result: AVAGFormer achieves state-of-the-art performance on AV-AG, surpassing baselines from related tasks. Comprehensive analyses highlight distinctions between AV-AG and AVS, benefits of end-to-end modeling, and contribution of each component.

Conclusion: Audio provides valuable cues for affordance grounding, and the proposed AVAGFormer effectively fuses audio-visual signals for interaction region segmentation. The released dataset and code support further research in this direction.

Abstract: We introduce Audio-Visual Affordance Grounding (AV-AG), a new task that segments object interaction regions from action sounds. Unlike existing approaches that rely on textual instructions or demonstration videos, which often limited by ambiguity or occlusion, audio provides real-time, semantically rich, and visually independent cues for affordance grounding, enabling more intuitive understanding of interaction regions. To support this task, we construct the first AV-AG dataset, comprising a large collection of action sounds, object images, and pixel-level affordance annotations. The dataset also includes an unseen subset to evaluate zero-shot generalization. Furthermore, we propose AVAGFormer, a model equipped with a semantic-conditioned cross-modal mixer and a dual-head decoder that effectively fuses audio and visual signals for mask prediction. Experiments show that AVAGFormer achieves state-of-the-art performance on AV-AG, surpassing baselines from related tasks. Comprehensive analyses highlight the distinctions between AV-AG and AVS, the benefits of end-to-end modeling, and the contribution of each component. Code and dataset have been released on https://jscslld.github.io/AVAGFormer/.

[357] ReCamDriving: LiDAR-Free Camera-Controlled Novel Trajectory Video Generation

Yaokun Li, Shuaixian Wang, Mantang Guo, Jiehui Huang, Taojun Ding, Mu Hu, Kaixuan Wang, Shaojie Shen, Guang Tan

Main category: cs.CV

TL;DR: ReCamDriving is a vision-based framework for generating novel driving trajectory videos using camera control, leveraging 3D Gaussian Splatting renderings for geometric guidance and a two-stage training approach to avoid overfitting.

Details

Motivation: Existing methods have limitations: repair-based approaches fail to restore complex artifacts, while LiDAR-based methods rely on sparse and incomplete cues. There's a need for precise camera-controllable video generation with dense geometric guidance.

Method: Uses 3DGS renderings for explicit geometric guidance with a two-stage training paradigm: first stage uses camera poses for coarse control, second stage incorporates 3DGS renderings for fine-grained guidance. Also introduces a 3DGS-based cross-trajectory data curation strategy to eliminate train-test gaps in camera transformations.

Result: Developed the ParaDrive dataset with over 110K parallel-trajectory video pairs. Achieved state-of-the-art camera controllability and structural consistency in extensive experiments.

Conclusion: ReCamDriving successfully addresses limitations of previous methods by leveraging dense 3DGS renderings and a novel training approach, enabling precise camera-controllable video generation for novel driving trajectories.

Abstract: We propose ReCamDriving, a purely vision-based, camera-controlled novel-trajectory video generation framework. While repair-based methods fail to restore complex artifacts and LiDAR-based approaches rely on sparse and incomplete cues, ReCamDriving leverages dense and scene-complete 3DGS renderings for explicit geometric guidance, achieving precise camera-controllable generation. To mitigate overfitting to restoration behaviors when conditioned on 3DGS renderings, ReCamDriving adopts a two-stage training paradigm: the first stage uses camera poses for coarse control, while the second stage incorporates 3DGS renderings for fine-grained viewpoint and geometric guidance. Furthermore, we present a 3DGS-based cross-trajectory data curation strategy to eliminate the train-test gap in camera transformation patterns, enabling scalable multi-trajectory supervision from monocular videos. Based on this strategy, we construct the ParaDrive dataset, containing over 110K parallel-trajectory video pairs. Extensive experiments demonstrate that ReCamDriving achieves state-of-the-art camera controllability and structural consistency.

[358] Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, Yujun Shen, Min Zhang

Main category: cs.CV

TL;DR: Reward Forcing: A novel framework for efficient streaming video generation that prevents initial frame copying and enhances motion dynamics through EMA-Sink tokens and Rewarded Distribution Matching Distillation.

Details

Motivation: Existing methods for streaming video generation use initial frames as sink tokens in sliding window attention, but this causes frames to become overly dependent on static tokens, resulting in copied initial frames and diminished motion dynamics.

Method: Two key designs: 1) EMA-Sink - maintains fixed-size tokens initialized from initial frames and continuously updated via exponential moving average as tokens exit the sliding window, capturing both long-term context and recent dynamics. 2) Rewarded Distribution Matching Distillation (Re-DMD) - biases model output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model.

Result: Achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.

Conclusion: Reward Forcing effectively addresses the limitations of existing streaming video generation methods by preventing initial frame copying while maintaining long-horizon consistency and significantly enhancing motion quality through novel token management and distillation techniques.

Abstract: Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model’s ability to prioritize dynamic content. Instead, Re-DMD biases the model’s output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.

[359] ZeBROD: Zero-Retraining Based Recognition and Object Detection Framework

Priyanto Hidayatullah, Nurjannah Syakrani, Yudi Widhiyasana, Muhammad Rizqi Sholahuddin, Refdinal Tubagus, Zahri Al Adzani Hidayat, Hanri Fajar Ramadhan, Dafa Alfarizki Pratama, Farhan Muhammad Yasin

Main category: cs.CV

TL;DR: ZeBROD is a zero-retraining object detection framework that combines YOLO11n for localization with DeIT and Proxy Anchor Loss for feature extraction, using cosine similarity with a vector database to avoid catastrophic forgetting when adding new products.

Details

Motivation: Address catastrophic forgetting in object detection where models need full retraining when new products are added, leading to high costs and time consumption, especially problematic in retail checkout with frequent product introductions.

Method: ZeBROD integrates YOLO11n for object localization, DeIT and Proxy Anchor Loss for feature extraction and metric learning, and uses cosine similarity between target product embeddings and a Qdrant vector database for classification without retraining.

Result: Achieves encouraging accuracy for both new and existing products in retail store case study with 140 products, achieves ~3x training time efficiency compared to classical approaches, and 580ms average inference time per image on edge device.

Conclusion: ZeBROD provides a practical solution to catastrophic forgetting in object detection, enabling efficient addition of new products without retraining, with validated feasibility for real-world retail applications on edge devices.

Abstract: Object detection constitutes the primary task within the domain of computer vision. It is utilized in numerous domains. Nonetheless, object detection continues to encounter the issue of catastrophic forgetting. The model must be retrained whenever new products are introduced, utilizing not only the new products dataset but also the entirety of the previous dataset. The outcome is obvious: increasing model training expenses and significant time consumption. In numerous sectors, particularly retail checkout, the frequent introduction of new products presents a great challenge. This study introduces Zero-Retraining Based Recognition and Object Detection (ZeBROD), a methodology designed to address the issue of catastrophic forgetting by integrating YOLO11n for object localization with DeIT and Proxy Anchor Loss for feature extraction and metric learning. For classification, we utilize cosine similarity between the embedding features of the target product and those in the Qdrant vector database. In a case study conducted in a retail store with 140 products, the experimental results demonstrate that our proposed framework achieves encouraging accuracy, whether for detecting new or existing products. Furthermore, without retraining, the training duration difference is significant. We achieve almost 3 times the training time efficiency compared to classical object detection approaches. This efficiency escalates as additional new products are added to the product database. The average inference time is 580 ms per image containing multiple products, on an edge device, validating the proposed framework’s feasibility for practical use.

[360] CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates

Shresth Grover, Priyank Pathak, Akash Kumar, Vibhav Vineet, Yogesh S Rawat

Main category: cs.CV

TL;DR: CoSPlan benchmark evaluates VLMs on error-prone visual sequential planning tasks, revealing their limitations in error detection and correction. The proposed SGI method improves performance by 5.2% through intermediate reasoning steps.

Details

Motivation: Large-scale Vision-Language Models (VLMs) have strong reasoning capabilities but are unexplored in visual sequential planning, especially in handling non-optimal/erroneous steps. There's a need to evaluate VLMs' ability to detect and correct errors in multi-step action sequences toward goals.

Method: 1) Proposed CoSPlan benchmark with 4 domains (maze navigation, block rearrangement, image reconstruction, object reorganization) to evaluate Error Detection and Step Completion abilities. 2) Developed Scene Graph Incremental updates (SGI), a training-free method that introduces intermediate reasoning steps between initial and goal states to help VLMs reason about sequences.

Result: State-of-the-art VLMs (Intern-VLM, Qwen2) struggle on CoSPlan even with Chain-of-Thought and Scene Graphs. SGI method yields average 5.2% performance gain, enhancing reliability in corrective sequential planning and generalizing to traditional planning tasks like Plan-Bench and VQA.

Conclusion: VLMs have limitations in error-prone visual sequential planning despite strong reasoning capabilities. The proposed SGI method effectively improves performance by enabling better sequential reasoning through intermediate steps, demonstrating potential for more reliable planning systems.

Abstract: Large-scale Vision-Language Models (VLMs) exhibit impressive complex reasoning capabilities but remain largely unexplored in visual sequential planning, i.e., executing multi-step actions towards a goal. Additionally, practical sequential planning often involves non-optimal (erroneous) steps, challenging VLMs to detect and correct such steps. We propose Corrective Sequential Planning Benchmark (CoSPlan) to evaluate VLMs in error-prone, vision-based sequential planning tasks across 4 domains: maze navigation, block rearrangement, image reconstruction,and object reorganization. CoSPlan assesses two key abilities: Error Detection (identifying non-optimal action) and Step Completion (correcting and completing action sequences to reach the goal). Despite using state-of-the-art reasoning techniques such as Chain-of-Thought and Scene Graphs, VLMs (e.g. Intern-VLM and Qwen2) struggle on CoSPlan, failing to leverage contextual cues to reach goals. Addressing this, we propose a novel training-free method, Scene Graph Incremental updates (SGI), which introduces intermediate reasoning steps between the initial and goal states. SGI helps VLMs reason about sequences, yielding an average performance gain of 5.2%. In addition to enhancing reliability in corrective sequential planning, SGI generalizes to traditional planning tasks such as Plan-Bench and VQA. Project Page : https://shroglck.github.io/cos_plan/

[361] Breaking the Vicious Cycle: Coherent 3D Gaussian Splatting from Sparse and Motion-Blurred Views

Zhankuo Xu, Chaoran Feng, Yingtao Li, Jianbin Zhao, Jiashu Yang, Wangbo Yu, Li Yuan, Yonghong Tian

Main category: cs.CV

TL;DR: CoherentGS is a novel framework that enables high-fidelity 3D reconstruction from sparse and blurry images by combining deblurring and diffusion priors, breaking the vicious cycle between sparse views and motion blur.

Details

Motivation: 3D Gaussian Splatting (3DGS) requires dense, high-quality input images, but real-world applications often have sparse and motion-blurred data. This creates a vicious cycle where sparse views can't resolve motion blur, and motion blur erases details needed for view alignment, leading to catastrophic reconstruction failures.

Method: CoherentGS uses a dual-prior strategy combining: 1) a specialized deblurring network for restoring sharp details and photometric guidance, and 2) a diffusion model providing geometric priors to fill unobserved regions. Additional techniques include consistency-guided camera exploration and depth regularization loss for geometric plausibility.

Result: The method was evaluated on synthetic and real-world scenes with as few as 3, 6, and 9 input views. CoherentGS significantly outperforms existing methods, setting a new state-of-the-art for 3D reconstruction from sparse and blurry images.

Conclusion: CoherentGS successfully breaks the vicious cycle between sparse views and motion blur in 3D reconstruction, enabling high-fidelity results from challenging real-world input data through its innovative dual-prior approach.

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a state-of-the-art method for novel view synthesis. However, its performance heavily relies on dense, high-quality input imagery, an assumption that is often violated in real-world applications, where data is typically sparse and motion-blurred. These two issues create a vicious cycle: sparse views ignore the multi-view constraints necessary to resolve motion blur, while motion blur erases high-frequency details crucial for aligning the limited views. Thus, reconstruction often fails catastrophically, with fragmented views and a low-frequency bias. To break this cycle, we introduce CoherentGS, a novel framework for high-fidelity 3D reconstruction from sparse and blurry images. Our key insight is to address these compound degradations using a dual-prior strategy. Specifically, we combine two pre-trained generative models: a specialized deblurring network for restoring sharp details and providing photometric guidance, and a diffusion model that offers geometric priors to fill in unobserved regions of the scene. This dual-prior strategy is supported by several key techniques, including a consistency-guided camera exploration module that adaptively guides the generative process, and a depth regularization loss that ensures geometric plausibility. We evaluate CoherentGS through both quantitative and qualitative experiments on synthetic and real-world scenes, using as few as 3, 6, and 9 input views. Our results demonstrate that CoherentGS significantly outperforms existing methods, setting a new state-of-the-art for this challenging task. The code and video demos are available at https://potatobigroom.github.io/CoherentGS/.

Wen-Jue He, Xiaofeng Zhu, Zheng Zhang

Main category: cs.CV

TL;DR: A novel Cross-modal Prompting (ComP) method for incomplete multi-modal emotion recognition that enhances modality-specific features and improves recognition accuracy by addressing performance gaps and modality under-optimization problems in missing data scenarios.

Details

Motivation: Incomplete multi-modal emotion recognition faces challenges with performance gaps and modality under-optimization problems that are exacerbated by missing data, hindering effective multi-modal learning despite the potential of multi-source data to provide abundant information.

Method: Proposes Cross-modal Prompting (ComP) with three key components: 1) progressive prompt generation module with dynamic gradient modulator for concise modality semantic cues, 2) cross-modal knowledge propagation to amplify consistent information using prompts, and 3) a coordinator for dynamic re-weighting of modality outputs as a complement to balance strategy.

Result: Extensive experiments on 4 datasets with 7 state-of-the-art methods under different missing rates validate the effectiveness of the proposed method, demonstrating improved recognition accuracy.

Conclusion: The Cross-modal Prompting method successfully addresses the challenges of incomplete multi-modal emotion recognition by enhancing modality-specific features and improving overall recognition accuracy through coherent information emphasis and performance boosting of each modality.

Abstract: Incomplete multi-modal emotion recognition (IMER) aims at understanding human intentions and sentiments by comprehensively exploring the partially observed multi-source data. Although the multi-modal data is expected to provide more abundant information, the performance gap and modality under-optimization problem hinder effective multi-modal learning in practice, and are exacerbated in the confrontation of the missing data. To address this issue, we devise a novel Cross-modal Prompting (ComP) method, which emphasizes coherent information by enhancing modality-specific features and improves the overall recognition accuracy by boosting each modality’s performance. Specifically, a progressive prompt generation module with a dynamic gradient modulator is proposed to produce concise and consistent modality semantic cues. Meanwhile, cross-modal knowledge propagation selectively amplifies the consistent information in modality features with the delivered prompts to enhance the discrimination of the modality-specific output. Additionally, a coordinator is designed to dynamically re-weight the modality outputs as a complement to the balance strategy to improve the model’s efficacy. Extensive experiments on 4 datasets with 7 SOTA methods under different missing rates validate the effectiveness of our proposed method.

[363] USTM: Unified Spatial and Temporal Modeling for Continuous Sign Language Recognition

Ahmed Abul Hasanaath, Hamzah Luqman

Main category: cs.CV

TL;DR: USTM is a unified spatio-temporal transformer framework for continuous sign language recognition that combines Swin Transformer with temporal adapters to capture fine-grained spatial features and long-range temporal dependencies from RGB videos alone.

Details

Motivation: Existing CSLR methods using CNN backbones with temporal convolutions/RNNs fail to capture fine-grained hand/facial cues and long-range temporal dependencies, requiring multi-stream or multi-modal inputs for good performance.

Method: Proposes USTM framework with Swin Transformer spatial backbone enhanced with lightweight temporal adapter with positional embeddings (TAPE) to model both spatial and temporal features in a unified encoder.

Result: Achieves state-of-the-art performance on PHOENIX14, PHOENIX14T, and CSL-Daily datasets against RGB-based and multi-modal approaches, competitive with multi-stream methods using only RGB input.

Conclusion: USTM demonstrates effective unified spatio-temporal modeling for CSLR, capturing fine-grained features and temporal dependencies without needing multi-stream or auxiliary modalities, offering strong performance from RGB videos alone.

Abstract: Continuous sign language recognition (CSLR) requires precise spatio-temporal modeling to accurately recognize sequences of gestures in videos. Existing frameworks often rely on CNN-based spatial backbones combined with temporal convolution or recurrent modules. These techniques fail in capturing fine-grained hand and facial cues and modeling long-range temporal dependencies. To address these limitations, we propose the Unified Spatio-Temporal Modeling (USTM) framework, a spatio-temporal encoder that effectively models complex patterns using a combination of a Swin Transformer backbone enhanced with lightweight temporal adapter with positional embeddings (TAPE). Our framework captures fine-grained spatial features alongside short and long-term temporal context, enabling robust sign language recognition from RGB videos without relying on multi-stream inputs or auxiliary modalities. Extensive experiments on benchmarked datasets including PHOENIX14, PHOENIX14T, and CSL-Daily demonstrate that USTM achieves state-of-the-art performance against RGB-based as well as multi-modal CSLR approaches, while maintaining competitive performance against multi-stream approaches. These results highlight the strength and efficacy of the USTM framework for CSLR. The code is available at https://github.com/gufranSabri/USTM

[364] GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Bozhou Li, Sihan Yang, Yushuo Guan, Ruichuan An, Xinlong Chen, Yang Shi, Pengfei Wan, Wentao Zhang, Yuanxing zhang

Main category: cs.CV

TL;DR: GRAN-TED introduces a paradigm for better text encoders in diffusion models, featuring TED-6K benchmark for efficient evaluation and a two-stage training method that improves text-to-image/video generation.

Details

Motivation: Current text encoders for diffusion models face two major challenges: lack of efficient evaluation frameworks that predict downstream generation performance, and difficulty adapting pretrained language models for visual synthesis.

Method: Two main contributions: 1) TED-6K benchmark - a text-only evaluation framework with lightweight unified adapter for efficient encoder assessment (750× faster than training diffusion models from scratch); 2) Two-stage training paradigm - initial fine-tuning on Multimodal Large Language Model for better visual representation, followed by layer-wise weighting to extract nuanced text features.

Result: TED-6K performance strongly correlates with downstream generation effectiveness. GRAN-TED encoder achieves state-of-the-art on TED-6K and demonstrates performance gains in text-to-image and text-to-video generation tasks.

Conclusion: GRAN-TED provides an efficient evaluation framework (TED-6K) and superior text encoder training method that addresses key limitations in text encoder development for diffusion models, enabling better semantic fidelity in generated content.

Abstract: The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder’s representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly correlates with an encoder’s effectiveness in downstream generation tasks. Notably, under our experimental setup, compared with training a diffusion model from scratch, evaluating with TED-6K is about \textbf{750$\times$ faster}. Second, guided by this validated framework, we develop a superior text encoder using a novel two-stage training paradigm. This process involves an initial fine-tuning stage on a Multimodal Large Language Model for better visual representation, followed by a layer-wise weighting method to extract more nuanced and potent text features. Our experiments show that the resulting GRAN-TED encoder not only achieves state-of-the-art performance on TED-6K but also leads to demonstrable performance gains in text-to-image and text-to-video generation. Our TED-6K dataset and evaluation code are available at the following link: https://anonymous.4open.science/r/GRAN-TED-4FCC/.

[365] Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt

Shangxun Li, Youngjung Uh

Main category: cs.CV

TL;DR: A training-free approach that refines text embeddings from a geometric perspective to improve subject consistency in text-to-image diffusion models without fine-tuning.

Details

Motivation: Text-to-image diffusion models struggle with preserving subject consistency across multiple outputs for visual storytelling. Existing approaches require computationally expensive fine-tuning or per-subject optimization, while training-free methods like 1Prompt1Story suffer from semantic leakage and text misalignment.

Method: A simple training-free approach that refines text embeddings from a geometric perspective to suppress unwanted semantics and address semantic entanglement, improving both subject consistency and text alignment.

Result: Extensive experiments show the approach significantly improves both subject consistency and text alignment over existing baselines.

Conclusion: The proposed geometric refinement of text embeddings provides an effective training-free solution for improving subject consistency in text-to-image diffusion models for visual storytelling applications.

Abstract: Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.

[366] MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

Kaixing Yang, Jiashu Zhu, Xulong Tang, Ziqiao Peng, Xiangyue Zhang, Puwei Wang, Jiahong Wu, Xiangxiang Chu, Hongyan Liu, Jun He

Main category: cs.CV

TL;DR: MACE-Dance is a music-driven dance video generation framework using cascaded Mixture-of-Experts that achieves SOTA performance by separating motion generation and appearance synthesis.

Details

Motivation: Existing methods for music-driven 3D dance generation, pose-driven image animation, and audio-driven talking-head synthesis cannot be directly adapted to music-driven dance video generation, and current approaches struggle to achieve both high-quality visual appearance and realistic human motion simultaneously.

Method: A cascaded Mixture-of-Experts framework with two specialized components: (1) Motion Expert uses a diffusion model with BiMamba-Transformer hybrid architecture and Guidance-Free Training for music-to-3D motion generation, and (2) Appearance Expert uses decoupled kinematic-aesthetic fine-tuning for motion- and reference-conditioned video synthesis.

Result: Achieves state-of-the-art performance in both 3D dance generation and pose-driven image animation, with comprehensive evaluation on a newly curated large-scale dataset using a motion-appearance evaluation protocol.

Conclusion: MACE-Dance successfully addresses the challenge of joint high-quality visual appearance and realistic motion in music-driven dance video generation through its cascaded expert architecture, establishing a new benchmark for the field.

Abstract: With the rise of online dance-video platforms and rapid advances in AI-generated content (AIGC), music-driven dance generation has emerged as a compelling research direction. Despite substantial progress in related domains such as music-driven 3D dance generation, pose-driven image animation, and audio-driven talking-head synthesis, existing methods cannot be directly adapted to this task. Moreover, the limited studies in this area still struggle to jointly achieve high-quality visual appearance and realistic human motion. Accordingly, we present MACE-Dance, a music-driven dance video generation framework with cascaded Mixture-of-Experts (MoE). The Motion Expert performs music-to-3D motion generation while enforcing kinematic plausibility and artistic expressiveness, whereas the Appearance Expert carries out motion- and reference-conditioned video synthesis, preserving visual identity with spatiotemporal coherence. Specifically, the Motion Expert adopts a diffusion model with a BiMamba-Transformer hybrid architecture and a Guidance-Free Training (GFT) strategy, achieving state-of-the-art (SOTA) performance in 3D dance generation. The Appearance Expert employs a decoupled kinematic-aesthetic fine-tuning strategy, achieving state-of-the-art (SOTA) performance in pose-driven image animation. To better benchmark this task, we curate a large-scale and diverse dataset and design a motion-appearance evaluation protocol. Based on this protocol, MACE-Dance also achieves state-of-the-art performance. Project page: https://macedance.github.io/

Meng Chu, Senqiao Yang, Haoxuan Che, Suiyun Zhang, Xichen Zhang, Shaozuo Yu, Haokun Gui, Zhefan Rao, Dandan Tu, Rui Liu, Jiaya Jia

Main category: cs.CV

TL;DR: VisionDirector is a training-free vision-language supervisor that improves generative models’ ability to handle long, multi-goal prompts by extracting structured goals, dynamically planning edit strategies, and using semantic verification with rollback.

Details

Motivation: Current generative models struggle with long, multi-goal prompts that professional designers use, missing localized edits and failing to satisfy complex requirements.

Method: VisionDirector extracts structured goals from long instructions, dynamically decides between one-shot generation vs staged edits, runs micro-grid sampling with semantic verification and rollback after every edit, and logs goal-level rewards. Also uses Group Relative Policy Optimization to fine-tune the planner.

Result: Achieves new SOTA on GenEval (+7% overall) and ImgEdit (+0.07 absolute), with shorter edit trajectories (3.1 vs 4.2 steps) and consistent qualitative improvements on typography, multi-object scenes, and pose editing.

Conclusion: VisionDirector effectively addresses the brittleness of current generative pipelines for long, multi-goal prompts through structured goal extraction, dynamic planning, and verification mechanisms.

Abstract: Generative models can now produce photorealistic imagery, yet they still struggle with the long, multi-goal prompts that professional designers issue. To expose this gap and better evaluate models’ performance in real-world settings, we introduce Long Goal Bench (LGBench), a 2,000-task suite (1,000 T2I and 1,000 I2I) whose average instruction contains 18 to 22 tightly coupled goals spanning global layout, local object placement, typography, and logo fidelity. We find that even state-of-the-art models satisfy fewer than 72 percent of the goals and routinely miss localized edits, confirming the brittleness of current pipelines. To address this, we present VisionDirector, a training-free vision-language supervisor that (i) extracts structured goals from long instructions, (ii) dynamically decides between one-shot generation and staged edits, (iii) runs micro-grid sampling with semantic verification and rollback after every edit, and (iv) logs goal-level rewards. We further fine-tune the planner with Group Relative Policy Optimization, yielding shorter edit trajectories (3.1 versus 4.2 steps) and stronger alignment. VisionDirector achieves new state of the art on GenEval (plus 7 percent overall) and ImgEdit (plus 0.07 absolute) while producing consistent qualitative improvements on typography, multi-object scenes, and pose editing.

[368] From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs

Mingrui Wu, Zhaozhi Wang, Fangjinhua Wang, Jiaolong Yang, Marc Pollefeys, Tong Zhang

Main category: cs.CV

TL;DR: MLLMs lack spatial intelligence, so researchers created a new benchmark with outdoor pedestrian-view videos using stereo cameras, LiDAR, and IMU/GPS to generate precise 3D spatial reasoning questions, revealing MLLMs rely on linguistic priors rather than visual reasoning.

Details

Motivation: Current MLLMs have impressive semantic capabilities but underdeveloped spatial intelligence, which is crucial for robust and grounded AI systems. Existing benchmarks are inadequate because they either focus on simplified qualitative reasoning or use domain-specific indoor data, lacking outdoor datasets with verifiable metric ground truth.

Method: Created a large-scale benchmark using pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This provides metrically precise 3D information, enabling automatic generation of spatial reasoning questions across a hierarchical spectrum from qualitative relational reasoning to quantitative metric and kinematic understanding.

Result: Performance gains observed in structured indoor benchmarks vanish in open-world settings. Analysis using synthetic abnormal scenes and blinding tests confirms that current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning.

Conclusion: The new benchmark provides a principled platform for diagnosing MLLMs’ spatial intelligence limitations and advancing physically grounded spatial reasoning capabilities.

Abstract: While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence–crucial for robust and grounded AI systems–remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum–from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings. Further analysis using synthetic abnormal scenes and blinding tests confirms that current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning. Our benchmark thus provides a principled platform for diagnosing these limitations and advancing physically grounded spatial intelligence.

[369] Learning to Refocus with Video Diffusion Models

SaiKiran Tedla, Zhoutong Zhang, Xuaner Zhang, Shumian Xin

Main category: cs.CV

TL;DR: A novel method for realistic post-capture refocusing using video diffusion models that generates focal stacks from single defocused images, enabling interactive focus adjustment.

Details

Motivation: Autofocus systems often fail to capture intended subjects, and users frequently want to adjust focus after capture, but current methods lack realistic post-capture refocusing capabilities.

Method: Uses video diffusion models to generate perceptually accurate focal stacks (represented as video sequences) from single defocused images, enabling interactive refocusing and downstream applications.

Result: Consistently outperforms existing approaches in both perceptual quality and robustness across challenging scenarios, with a large-scale focal stack dataset released for real-world smartphone conditions.

Conclusion: The method paves the way for more advanced focus-editing capabilities in everyday photography, with code and data publicly available.

Abstract: Focus is a cornerstone of photography, yet autofocus systems often fail to capture the intended subject, and users frequently wish to adjust focus after capture. We introduce a novel method for realistic post-capture refocusing using video diffusion models. From a single defocused image, our approach generates a perceptually accurate focal stack, represented as a video sequence, enabling interactive refocusing and unlocking a range of downstream applications. We release a large-scale focal stack dataset acquired under diverse real-world smartphone conditions to support this work and future research. Our method consistently outperforms existing approaches in both perceptual quality and robustness across challenging scenarios, paving the way for more advanced focus-editing capabilities in everyday photography. Code and data are available at www.learn2refocus.github.io

[370] VALLR-Pin: Uncertainty-Factorized Visual Speech Recognition for Mandarin with Pinyin Guidance

Chang Sun, Dongliang Xie, Wanpeng Xie, Bo Qin, Hong Yang

Main category: cs.CV

TL;DR: VALLR-Pin is a two-stage Mandarin VSR framework that uses Pinyin as intermediate representation and LLM refinement to handle viseme ambiguity and homophones.

Details

Motivation: Mandarin VSR is challenging due to severe viseme ambiguity and pervasive homophones, requiring better approaches to handle these linguistic complexities.

Method: Two-stage framework: 1) Shared visual encoder with dual decoders predicting characters and Pinyin jointly; 2) LLM-based refinement using predicted Pinyin and character hypotheses to resolve homophone ambiguities, fine-tuned on synthetic instruction data.

Result: Consistent improvement in transcription accuracy on public Mandarin VSR benchmarks under multi-speaker conditions, demonstrating effectiveness of phonetic guidance with lightweight LLM refinement.

Conclusion: Combining phonetic guidance (Pinyin) with LLM-based refinement effectively addresses Mandarin VSR challenges, providing a robust solution for viseme ambiguity and homophone resolution.

Abstract: Visual speech recognition (VSR) aims to transcribe spoken content from silent lip-motion videos and is particularly challenging in Mandarin due to severe viseme ambiguity and pervasive homophones. We propose VALLR-Pin, a two-stage Mandarin VSR framework that extends the VALLR architecture by explicitly incorporating Pinyin as an intermediate representation. In the first stage, a shared visual encoder feeds dual decoders that jointly predict Mandarin characters and their corresponding Pinyin sequences, encouraging more robust visual-linguistic representations. In the second stage, an LLM-based refinement module takes the predicted Pinyin sequence together with an N-best list of character hypotheses to resolve homophone-induced ambiguities. To further adapt the LLM to visual recognition errors, we fine-tune it on synthetic instruction data constructed from model-generated Pinyin-text pairs, enabling error-aware correction. Experiments on public Mandarin VSR benchmarks demonstrate that VALLR-Pin consistently improves transcription accuracy under multi-speaker conditions, highlighting the effectiveness of combining phonetic guidance with lightweight LLM refinement.

[371] MVInverse: Feed-forward Multi-view Inverse Rendering in Seconds

Xiangzuo Wu, Chengwei Ren, Jun Zhou, Xiu Li, Yuan Liu

Main category: cs.CV

TL;DR: A feed-forward multi-view inverse rendering framework that predicts scene properties (albedo, materials, shading, normals) from RGB image sequences with cross-view attention, plus consistency-based finetuning for real-world generalization.

Details

Motivation: Existing single-view methods ignore cross-view consistency, while multi-view optimization methods are computationally expensive and slow. There's also a generalization gap between synthetic training data and real-world scenes.

Method: Feed-forward model with alternating attention across views to capture intra-view lighting interactions and inter-view material consistency. Uses consistency-based finetuning on unlabeled real-world videos to improve generalization.

Result: Achieves state-of-the-art performance in multi-view consistency, material/normal estimation quality, and generalization to real-world imagery on benchmark datasets.

Conclusion: Proposed framework enables efficient, consistent multi-view inverse rendering with strong real-world generalization through novel attention mechanisms and consistency-based finetuning.

Abstract: Multi-view inverse rendering aims to recover geometry, materials, and illumination consistently across multiple viewpoints. When applied to multi-view images, existing single-view approaches often ignore cross-view relationships, leading to inconsistent results. In contrast, multi-view optimization methods rely on slow differentiable rendering and per-scene refinement, making them computationally expensive and hard to scale. To address these limitations, we introduce a feed-forward multi-view inverse rendering framework that directly predicts spatially varying albedo, metallic, roughness, diffuse shading, and surface normals from sequences of RGB images. By alternating attention across views, our model captures both intra-view long-range lighting interactions and inter-view material consistency, enabling coherent scene-level reasoning within a single forward pass. Due to the scarcity of real-world training data, models trained on existing synthetic datasets often struggle to generalize to real-world scenes. To overcome this limitation, we propose a consistency-based finetuning strategy that leverages unlabeled real-world videos to enhance both multi-view coherence and robustness under in-the-wild conditions. Extensive experiments on benchmark datasets demonstrate that our method achieves state-of-the-art performance in terms of multi-view consistency, material and normal estimation quality, and generalization to real-world imagery. Project page: https://maddog241.github.io/mvinverse-page/

[372] Efficient and Robust Video Defense Framework against 3D-field Personalized Talking Face

Rui-qing Sun, Xingshan Yao, Tian Lan, Jia-Ling Shi, Chen-Hao Cui, Hui-Yang Zhao, Zhijing Wu, Chen Yang, Xian-Ling Mao

Main category: cs.CV

TL;DR: A novel video defense framework that protects portrait videos against 3D-field talking face generation attacks by perturbing 3D information acquisition while maintaining video quality, achieving 47x speedup over baselines.

Details

Motivation: State-of-the-art 3D-field talking face generation methods can synthesize realistic talking face videos from reference portraits, raising serious privacy concerns about malicious misuse. Existing image-based defenses are inefficient, computationally expensive, degrade video quality, and fail to disrupt 3D information needed for video protection.

Method: Proposes an efficient video defense framework that protects portrait videos by perturbing the 3D information acquisition process. Key innovations include: (1) similarity-guided parameter sharing mechanism for computational efficiency, and (2) multi-scale dual-domain attention module to jointly optimize spatial-frequency perturbations.

Result: Extensive experiments show strong defense capability with 47x acceleration over the fastest baseline while maintaining high fidelity. The framework remains robust against scaling operations and state-of-the-art purification attacks, with design choices validated through ablation studies.

Conclusion: The proposed framework effectively addresses the privacy threat of 3D-field talking face generation by providing an efficient, high-fidelity defense solution that protects portrait videos while maintaining computational efficiency and robustness against attacks.

Abstract: State-of-the-art 3D-field video-referenced Talking Face Generation (TFG) methods synthesize high-fidelity personalized talking-face videos in real time by modeling 3D geometry and appearance from reference portrait video. This capability raises significant privacy concerns regarding malicious misuse of personal portraits. However, no efficient defense framework exists to protect such videos against 3D-field TFG methods. While image-based defenses could apply per-frame 2D perturbations, they incur prohibitive computational costs, severe video quality degradation, failing to disrupt 3D information for video protection. To address this, we propose a novel and efficient video defense framework against 3D-field TFG methods, which protects portrait video by perturbing the 3D information acquisition process while maintain high-fidelity video quality. Specifically, our method introduces: (1) a similarity-guided parameter sharing mechanism for computational efficiency, and (2) a multi-scale dual-domain attention module to jointly optimize spatial-frequency perturbations. Extensive experiments demonstrate that our proposed framework exhibits strong defense capability and achieves a 47x acceleration over the fastest baseline while maintaining high fidelity. Moreover, it remains robust against scaling operations and state-of-the-art purification attacks, and the effectiveness of our design choices is further validated through ablation studies. Our project is available at https://github.com/Richen7418/VDF.

[373] UniPR-3D: Towards Universal Visual Place Recognition with Visual Geometry Grounded Transformer

Tianchen Deng, Xun Chen, Ziming Li, Hongming Shen, Danwei Wang, Javier Civera, Hesheng Wang

Main category: cs.CV

TL;DR: UniPR-3D introduces a novel multi-view Visual Place Recognition architecture that combines 2D and 3D features from VGGT backbone with dedicated aggregation modules, achieving state-of-the-art performance.

Details

Motivation: Traditional VPR uses single-image retrieval, but multi-view approaches offer advantages yet remain underexplored and struggle with generalization across diverse environments.

Method: Builds on VGGT backbone for multi-view 3D representations, designs feature aggregators for both 2D and 3D tokens, incorporates single- and multi-frame aggregation schemes, and uses variable-length sequence retrieval strategy.

Result: UniPR-3D sets new state-of-the-art, outperforming both single- and multi-view baselines, demonstrating effectiveness of geometry-grounded tokens for VPR.

Conclusion: The proposed architecture effectively integrates multi-view information for VPR, highlighting the value of combining 2D texture cues with 3D viewpoint reasoning, with code and models to be publicly released.

Abstract: Visual Place Recognition (VPR) has been traditionally formulated as a single-image retrieval task. Using multiple views offers clear advantages, yet this setting remains relatively underexplored and existing methods often struggle to generalize across diverse environments. In this work we introduce UniPR-3D, the first VPR architecture that effectively integrates information from multiple views. UniPR-3D builds on a VGGT backbone capable of encoding multi-view 3D representations, which we adapt by designing feature aggregators and fine-tune for the place recognition task. To construct our descriptor, we jointly leverage the 3D tokens and intermediate 2D tokens produced by VGGT. Based on their distinct characteristics, we design dedicated aggregation modules for 2D and 3D features, allowing our descriptor to capture fine-grained texture cues while also reasoning across viewpoints. To further enhance generalization, we incorporate both single- and multi-frame aggregation schemes, along with a variable-length sequence retrieval strategy. Our experiments show that UniPR-3D sets a new state of the art, outperforming both single- and multi-view baselines and highlighting the effectiveness of geometry-grounded tokens for VPR. Our code and models will be made publicly available on Github https://github.com/dtc111111/UniPR-3D.

[374] Omni-Weather: Unified Multimodal Foundation Model for Weather Generation and Understanding

Zhiwang Zhou, Yuandong Pu, Xuming He, Yidi Liu, Yixin Chen, Junchao Gong, Xiang Zhuang, Wanghan Xu, Qinglong Cao, Shixiang Tang, Yihao Liu, Wenlong Zhang, Lei Bai

Main category: cs.CV

TL;DR: Omni-Weather is the first multimodal foundation model that unifies weather generation and understanding in a single architecture, achieving SOTA performance in both tasks through shared processing and causal reasoning.

Details

Motivation: Existing weather modeling methods treat accurate prediction and mechanistic interpretation in isolation, separating generation from understanding. There's a need for a unified approach that addresses both goals simultaneously.

Method: Omni-Weather integrates a radar encoder for weather generation tasks with unified processing using a shared self-attention mechanism. The authors also construct a Chain-of-Thought dataset for causal reasoning in weather generation to enable interpretable outputs.

Result: Extensive experiments show Omni-Weather achieves state-of-the-art performance in both weather generation and understanding. The model demonstrates that generative and understanding tasks in weather domain can mutually enhance each other.

Conclusion: Omni-Weather demonstrates the feasibility and value of unifying weather generation and understanding within a single multimodal foundation model, showing these tasks can be mutually beneficial rather than treated separately.

Abstract: Weather modeling requires both accurate prediction and mechanistic interpretation, yet existing methods treat these goals in isolation, separating generation from understanding. To address this gap, we present Omni-Weather, the first multimodal foundation model that unifies weather generation and understanding within a single architecture. Omni-Weather integrates a radar encoder for weather generation tasks, followed by unified processing using a shared self-attention mechanism. Moreover, we construct a Chain-of-Thought dataset for causal reasoning in weather generation, enabling interpretable outputs and improved perceptual quality. Extensive experiments show Omni-Weather achieves state-of-the-art performance in both weather generation and understanding. Our findings further indicate that generative and understanding tasks in the weather domain can mutually enhance each other. Omni-Weather also demonstrates the feasibility and value of unifying weather generation and understanding.

[375] RAPTOR: Real-Time High-Resolution UAV Video Prediction with Efficient Video Attention

Zhan Chen, Zile Guo, Enze Zhu, Peirong Zhang, Xiaoxuan Liu, Lei Wang, Yidan Zhang

Main category: cs.CV

TL;DR: RAPTOR is a real-time video prediction architecture that breaks the resolution-quality-speed trilemma using Efficient Video Attention (EVA) for O(S+T) complexity, enabling 30+ FPS at 512² resolution on edge hardware.

Details

Motivation: Video prediction faces a fundamental trilemma: high-resolution and perceptual quality typically sacrifice real-time speed, which is critical for latency-sensitive applications like autonomous UAVs in dense urban environments where safety depends on foreseeing events from high-resolution imagery.

Method: RAPTOR uses a single-pass design with Efficient Video Attention (EVA), a novel translator module that factorizes spatiotemporal modeling by alternating operations along spatial (S) and temporal (T) axes, reducing complexity from O((ST)²) to O(S+T). It also employs a 3-stage training curriculum that progressively refines predictions from coarse structure to sharp details.

Result: RAPTOR achieves over 30 FPS on Jetson AGX Orin for 512² video, setting new SOTA on UAVid, KTH, and custom high-resolution datasets in PSNR, SSIM, and LPIPS. It boosts mission success rate in real-world UAV navigation by 18%.

Conclusion: RAPTOR breaks the long-standing trade-off in video prediction, enabling real-time high-resolution performance on edge hardware, paving the way for safer and more anticipatory embodied agents like autonomous UAVs.

Abstract: Video prediction is plagued by a fundamental trilemma: achieving high-resolution and perceptual quality typically comes at the cost of real-time speed, hindering its use in latency-critical applications. This challenge is most acute for autonomous UAVs in dense urban environments, where foreseeing events from high-resolution imagery is non-negotiable for safety. Existing methods, reliant on iterative generation (diffusion, autoregressive models) or quadratic-complexity attention, fail to meet these stringent demands on edge hardware. To break this long-standing trade-off, we introduce RAPTOR, a video prediction architecture that achieves real-time, high-resolution performance. RAPTOR’s single-pass design avoids the error accumulation and latency of iterative approaches. Its core innovation is Efficient Video Attention (EVA), a novel translator module that factorizes spatiotemporal modeling. Instead of processing flattened spacetime tokens with $O((ST)^2)$ or $O(ST)$ complexity, EVA alternates operations along the spatial (S) and temporal (T) axes. This factorization reduces the time complexity to $O(S + T)$ and memory complexity to $O(max(S, T))$, enabling global context modeling at $512^2$ resolution and beyond, operating directly on dense feature maps with a patch-free design. Complementing this architecture is a 3-stage training curriculum that progressively refines predictions from coarse structure to sharp, temporally coherent details. Experiments show RAPTOR is the first predictor to exceed 30 FPS on a Jetson AGX Orin for $512^2$ video, setting a new state-of-the-art on UAVid, KTH, and a custom high-resolution dataset in PSNR, SSIM, and LPIPS. Critically, RAPTOR boosts the mission success rate in a real-world UAV navigation task by 18%, paving the way for safer and more anticipatory embodied agents.

[376] Knot Forcing: Taming Autoregressive Video Diffusion Models for Real-time Infinite Interactive Portrait Animation

Steven Xiao, Xindi Zhang, Dechao Meng, Qi Wang, Peng Zhang, Bang Zhang

Main category: cs.CV

TL;DR: Knot Forcing is a streaming framework for real-time portrait animation that achieves high visual fidelity, temporal coherence, and ultra-low latency through chunk-wise generation with KV caching, temporal knot modules for smooth transitions, and a “running ahead” mechanism for long-term consistency.

Details

Motivation: Real-time portrait animation needs high visual quality, temporal coherence, low latency, and responsive control for interactive applications like virtual assistants and live avatars. Current approaches have limitations: diffusion models are non-causal and not suitable for streaming, while causal autoregressive methods suffer from error accumulation, motion discontinuities at chunk boundaries, and degraded long-term consistency.

Method: Three key designs: (1) Chunk-wise generation with global identity preservation via cached KV states of reference images and local temporal modeling using sliding window attention; (2) Temporal knot module that overlaps adjacent chunks and propagates spatio-temporal cues via image-to-video conditioning to smooth inter-chunk motion transitions; (3) “Running ahead” mechanism that dynamically updates the reference frame’s temporal coordinate during inference to keep its semantic context ahead of current rollout frame for long-term coherence.

Result: Knot Forcing enables high-fidelity, temporally consistent, and interactive portrait animation over infinite sequences, achieving real-time performance with strong visual stability on consumer-grade GPUs.

Conclusion: The proposed streaming framework successfully addresses the challenges of real-time portrait animation by combining chunk-wise generation with effective temporal coherence mechanisms, making it suitable for interactive applications requiring both quality and low latency.

Abstract: Real-time portrait animation is essential for interactive applications such as virtual assistants and live avatars, requiring high visual fidelity, temporal coherence, ultra-low latency, and responsive control from dynamic inputs like reference images and driving signals. While diffusion-based models achieve strong quality, their non-causal nature hinders streaming deployment. Causal autoregressive video generation approaches enable efficient frame-by-frame generation but suffer from error accumulation, motion discontinuities at chunk boundaries, and degraded long-term consistency. In this work, we present a novel streaming framework named Knot Forcing for real-time portrait animation that addresses these challenges through three key designs: (1) a chunk-wise generation strategy with global identity preservation via cached KV states of the reference image and local temporal modeling using sliding window attention; (2) a temporal knot module that overlaps adjacent chunks and propagates spatio-temporal cues via image-to-video conditioning to smooth inter-chunk motion transitions; and (3) A “running ahead” mechanism that dynamically updates the reference frame’s temporal coordinate during inference, keeping its semantic context ahead of the current rollout frame to support long-term coherence. Knot Forcing enables high-fidelity, temporally consistent, and interactive portrait animation over infinite sequences, achieving real-time performance with strong visual stability on consumer-grade GPUs.

cs.AI

[377] Bidirectional RAG: Safe Self-Improving Retrieval-Augmented Generation Through Multi-Stage Validation

Teja Chinthala

Main category: cs.AI

TL;DR: Bidirectional RAG enables safe corpus expansion through validated write-back of high-quality generated responses, nearly doubling coverage while adding 72% fewer documents than naive write-back.

Details

Motivation: Conventional RAG systems use static knowledge bases that cannot evolve from user interactions, limiting their ability to accumulate knowledge over time.

Method: Introduces Bidirectional RAG with multi-stage acceptance layer: grounding verification (NLI-based entailment), attribution checking, and novelty detection to prevent hallucination pollution while enabling safe knowledge accumulation.

Result: Across 4 datasets (Natural Questions, TriviaQA, HotpotQA, Stack Overflow) with 3 random seeds (12 experiments per system), Bidirectional RAG achieves 40.58% average coverage (nearly doubling Standard RAG’s 20.33%) while adding 72% fewer documents than naive write-back (140 vs 500).

Conclusion: Self-improving RAG is feasible and safe when governed by rigorous validation, offering a practical path toward RAG systems that learn from deployment.

Abstract: Retrieval-Augmented Generation RAG systems enhance large language models by grounding responses in external knowledge bases, but conventional RAG architectures operate with static corpora that cannot evolve from user interactions. We introduce Bidirectional RAG, a novel RAG architecture that enables safe corpus expansion through validated write back of high quality generated responses. Our system employs a multi stage acceptance layer combining grounding verification (NLI based entailment, attribution checking, and novelty detection to prevent hallucination pollution while enabling knowledge accumulation. Across four datasets Natural Questions, TriviaQA, HotpotQA, Stack Overflow with three random seeds 12 experiments per system, Bidirectional RAG achieves 40.58% average coverage nearly doubling Standard RAG 20.33% while adding 72% fewer documents than naive write back 140 vs 500. Our work demonstrates that self improving RAG is feasible and safe when governed by rigorous validation, offering a practical path toward RAG systems that learn from deployment.

[378] Emergent Persuasion: Will LLMs Persuade Without Being Prompted?

Vincent Chang, Thee Ho, Sunishchal Dev, Kevin Zhu, Shi Feng, Kellin Pelrine, Matthew Kowal

Main category: cs.AI

TL;DR: LLMs can persuade users without explicit prompting, especially after supervised fine-tuning on persuasion datasets, raising concerns about emergent harmful persuasion risks.

Details

Motivation: Previous research focused on LLM persuasion through misuse (bad actors explicitly prompting models to persuade). This paper investigates when models might persuade without explicit prompting, which represents a more concerning emergent risk that needs to be understood.

Method: Study unprompted persuasion in two scenarios: (1) when models are steered via internal activation steering along persona traits, and (2) when models are supervised-finetuned (SFT) to exhibit persuasion traits. Test both persuasion-related and unrelated traits.

Result: Activation steering towards traits (both persuasion-related and unrelated) does not reliably increase unprompted persuasion. However, supervised fine-tuning (SFT) does increase models’ tendency to persuade without explicit prompting. SFT on benign persuasion datasets leads to models with higher propensity to persuade on controversial/harmful topics.

Conclusion: Emergent harmful persuasion can arise from SFT on general persuasion datasets, even when those datasets contain only benign topics. This represents a significant risk that requires further study and mitigation strategies.

Abstract: With the wide-scale adoption of conversational AI systems, AI are now able to exert unprecedented influence on human opinion and beliefs. Recent work has shown that many Large Language Models (LLMs) comply with requests to persuade users into harmful beliefs or actions when prompted and that model persuasiveness increases with model scale. However, this prior work looked at persuasion from the threat model of $\textit{misuse}$ (i.e., a bad actor asking an LLM to persuade). In this paper, we instead aim to answer the following question: Under what circumstances would models persuade $\textit{without being explicitly prompted}$, which would shape how concerned we should be about such emergent persuasion risks. To achieve this, we study unprompted persuasion under two scenarios: (i) when the model is steered (through internal activation steering) along persona traits, and (ii) when the model is supervised-finetuned (SFT) to exhibit the same traits. We showed that steering towards traits, both related to persuasion and unrelated, does not reliably increase models’ tendency to persuade unprompted, however, SFT does. Moreover, SFT on general persuasion datasets containing solely benign topics admits a model that has a higher propensity to persuade on controversial and harmful topics–showing that emergent harmful persuasion can arise and should be studied further.

[379] GamiBench: Evaluating Spatial Reasoning and 2D-to-3D Planning Capabilities of MLLMs with Origami Folding Tasks

Ryan Spencer, Roey Yaari, Ritvik Vemavarapu, Joyce Yang, Steven Ngo, Utkarsh Sharma

Main category: cs.AI

TL;DR: GamiBench is a new benchmark for evaluating spatial reasoning in multimodal LLMs using origami-inspired folding tasks with 2D crease patterns and 3D shapes across multiple viewpoints.

Details

Motivation: Current MLLMs struggle with spatial reasoning - the ability to mentally track and manipulate objects across multiple views and over time. Existing benchmarks focus on static images or final outputs, missing the sequential and viewpoint-dependent nature of spatial reasoning.

Method: Created GamiBench with 186 regular and 186 impossible 2D crease patterns paired with 3D folded shapes from six viewpoints. Includes three VQA tasks: predicting 3D fold configurations, distinguishing valid viewpoints, and detecting impossible patterns. Introduces new metrics: viewpoint consistency (VC) and impossible fold selection rate (IFSR).

Result: Even leading models like GPT-5 and Gemini-2.5-Pro struggle with single-step spatial understanding, showing significant limitations in spatial reasoning capabilities.

Conclusion: GamiBench establishes a standardized framework for evaluating geometric understanding and spatial reasoning in MLLMs, providing holistic assessment of the entire reasoning process rather than just final predictions.

Abstract: Multimodal large language models (MLLMs) are proficient in perception and instruction-following, but they still struggle with spatial reasoning: the ability to mentally track and manipulate objects across multiple views and over time. Spatial reasoning is a key component of human intelligence, but most existing benchmarks focus on static images or final outputs, failing to account for the sequential and viewpoint-dependent nature of this skill. To close this gap, we introduce GamiBench, a benchmark designed to evaluate spatial reasoning and 2D-to-3D planning in MLLMs through origami-inspired folding tasks. GamiBench includes 186 regular and 186 impossible 2D crease patterns paired with their corresponding 3D folded shapes, produced from six distinct viewpoints across three visual question-answering (VQA) tasks: predicting 3D fold configurations, distinguishing valid viewpoints, and detecting impossible patterns. Unlike previous benchmarks that assess only final predictions, GamiBench holistically evaluates the entire reasoning process–measuring cross-view consistency, physical feasibility through impossible-fold detection, and interpretation of intermediate folding steps. It further introduces new diagnostic metrics–viewpoint consistency (VC) and impossible fold selection rate (IFSR)–to measure how well models handle folds of varying complexity. Our experiments show that even leading models such as GPT-5 and Gemini-2.5-Pro struggle on single-step spatial understanding. These contributions establish a standardized framework for evaluating geometric understanding and spatial reasoning in MLLMs. Dataset and code: https://github.com/stvngo/GamiBench.

[380] Toward Equitable Recovery: A Fairness-Aware AI Framework for Prioritizing Post-Flood Aid in Bangladesh

Farjana Yesmin, Romana Akter

Main category: cs.AI

TL;DR: AI fairness framework reduces biases in post-flood aid allocation in Bangladesh by 41.6%, ensuring vulnerable regions receive equitable assistance based on need rather than historical patterns.

Details

Motivation: Post-disaster aid allocation in developing nations suffers from systematic biases that disadvantage vulnerable regions, perpetuating historical inequities. Bangladesh's recurring flood disasters highlight the need for fairer aid distribution systems.

Method: Developed an adversarial debiasing model using fairness-aware representation learning adapted from healthcare AI. Employed gradient reversal layer to force bias-invariant representations. Used real 2022 Bangladesh flood data affecting 7.2M people across 87 upazilas in 11 districts.

Result: Reduced statistical parity difference by 41.6%, decreased regional fairness gaps by 43.2%, maintained strong predictive accuracy (R-squared=0.784 vs baseline 0.811). Generated actionable priority rankings for aid distribution.

Conclusion: Algorithmic fairness techniques can be effectively applied to humanitarian contexts, providing decision-makers with tools for more equitable disaster recovery strategies that prioritize genuine need over historical allocation patterns.

Abstract: Post-disaster aid allocation in developing nations often suffers from systematic biases that disadvantage vulnerable regions, perpetuating historical inequities. This paper presents a fairness-aware artificial intelligence framework for prioritizing post-flood aid distribution in Bangladesh, a country highly susceptible to recurring flood disasters. Using real data from the 2022 Bangladesh floods that affected 7.2 million people and caused 405.5 million US dollars in damages, we develop an adversarial debiasing model that predicts flood vulnerability while actively removing biases against marginalized districts and rural areas. Our approach adapts fairness-aware representation learning techniques from healthcare AI to disaster management, employing a gradient reversal layer that forces the model to learn bias-invariant representations. Experimental results on 87 upazilas across 11 districts demonstrate that our framework reduces statistical parity difference by 41.6 percent, decreases regional fairness gaps by 43.2 percent, and maintains strong predictive accuracy (R-squared=0.784 vs baseline 0.811). The model generates actionable priority rankings ensuring aid reaches the most vulnerable populations based on genuine need rather than historical allocation patterns. This work demonstrates how algorithmic fairness techniques can be effectively applied to humanitarian contexts, providing decision-makers with tools to implement more equitable disaster recovery strategies.

[381] With Great Capabilities Come Great Responsibilities: Introducing the Agentic Risk & Capability Framework for Governing Agentic AI Systems

Shaun Khoo, Jessica Foo, Roy Ka-Wei Lee

Main category: cs.AI

TL;DR: The paper introduces the Agentic Risk & Capability (ARC) Framework, a technical governance framework for managing risks in autonomous AI systems that can execute code, interact with the internet, and modify files.

Details

Motivation: Agentic AI systems present significant opportunities but also novel risks due to their autonomous capabilities, creating challenges for organizational governance in identifying, assessing, and mitigating evolving risks.

Method: The ARC Framework uses a capability-centric perspective to analyze agentic AI systems, identifies three primary risk sources (components, design, and capabilities), establishes connections between risk sources and technical controls, and provides a structured implementation approach.

Result: The framework provides a robust, adaptable methodology for organizations to navigate agentic AI complexities, enabling innovation while ensuring safe, secure, and responsible deployment of autonomous AI systems.

Conclusion: The ARC Framework offers a practical solution for technical governance of agentic AI, helping organizations balance innovation with risk management, and is made available as open-source for broader adoption.

Abstract: Agentic AI systems present both significant opportunities and novel risks due to their capacity for autonomous action, encompassing tasks such as code execution, internet interaction, and file modification. This poses considerable challenges for effective organizational governance, particularly in comprehensively identifying, assessing, and mitigating diverse and evolving risks. To tackle this, we introduce the Agentic Risk & Capability (ARC) Framework, a technical governance framework designed to help organizations identify, assess, and mitigate risks arising from agentic AI systems. The framework’s core contributions are: (1) it develops a novel capability-centric perspective to analyze a wide range of agentic AI systems; (2) it distills three primary sources of risk intrinsic to agentic AI systems - components, design, and capabilities; (3) it establishes a clear nexus between each risk source, specific materialized risks, and corresponding technical controls; and (4) it provides a structured and practical approach to help organizations implement the framework. This framework provides a robust and adaptable methodology for organizations to navigate the complexities of agentic AI, enabling rapid and effective innovation while ensuring the safe, secure, and responsible deployment of agentic AI systems. Our framework is open-sourced \href{https://govtech-responsibleai.github.io/agentic-risk-capability-framework/}{here}.

[382] We are not able to identify AI-generated images

Adrien Pavão

Main category: cs.AI

TL;DR: Humans perform only slightly better than random chance (54% accuracy) at distinguishing AI-generated portrait images from real photographs, despite believing they can easily tell them apart.

Details

Motivation: To test the common assumption that people can easily distinguish AI-generated images from real photographs, especially as synthetic media becomes more pervasive and realistic online.

Method: Interactive web experiment where 165 participants classified 20 images each (120 difficult cases total) as real or AI-generated. Real images came from CC12M dataset, while AI-generated counterparts were carefully curated using MidJourney.

Result: Average accuracy was only 54% (slightly above random chance), with limited improvement across repeated attempts. Response times averaged 7.3 seconds, and some images were consistently more deceptive than others.

Conclusion: Humans struggle to reliably detect AI-generated content even on relatively simple portrait images. As synthetic media improves, human judgment alone is insufficient for distinguishing real from artificial data, highlighting the need for greater awareness and ethical guidelines.

Abstract: AI-generated images are now pervasive online, yet many people believe they can easily tell them apart from real photographs. We test this assumption through an interactive web experiment where participants classify 20 images as real or AI-generated. Our dataset contains 120 difficult cases: real images sampled from CC12M, and carefully curated AI-generated counterparts produced with MidJourney. In total, 165 users completed 233 sessions. Their average accuracy was 54%, only slightly above random guessing, with limited improvement across repeated attempts. Response times averaged 7.3 seconds, and some images were consistently more deceptive than others. These results indicate that, even on relatively simple portrait images, humans struggle to reliably detect AI-generated content. As synthetic media continues to improve, human judgment alone is becoming insufficient for distinguishing real from artificial data. These findings highlight the need for greater awareness and ethical guidelines as AI-generated media becomes increasingly indistinguishable from reality.

[383] Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks

Abhranil Chandra, Ayush Agrawal, Arian Hosseini, Sebastian Fischmeister, Rishabh Agarwal, Navin Goyal, Aaron Courville

Main category: cs.AI

TL;DR: Training language models on synthetic chain-of-thought traces from more capable models improves reasoning performance, even when those traces contain incorrect final answers.

Details

Motivation: To understand why synthetic reasoning traces from more capable models can improve a language model's reasoning capabilities, even when those traces lead to wrong answers, and to explore the factors behind this counterintuitive phenomenon.

Method: Used synthetic CoT datasets from more capable models to train smaller models, tested across various reasoning domains (math, algorithmic reasoning, code generation) using MATH, GSM8K, Countdown and MBPP datasets on models ranging from 1.5B to 9B parameters. Also conducted experiments: 1) paraphrasing human-annotated traces to shift distribution closer to model’s own, 2) introducing increasingly flawed CoT traces to test tolerance to reasoning errors.

Result: Synthetic CoT training outperforms human-annotated datasets on reasoning tasks. Two key factors identified: 1) synthetic data distribution is closer to model’s own distribution, making learning easier, 2) “incorrect” traces often contain valid reasoning steps that models can learn from. Paraphrasing human traces improves performance, and models show tolerance to partially flawed reasoning traces.

Conclusion: Dataset curation should prioritize closeness to model’s distribution over correctness of final answers. Correct final answers are not always reliable indicators of faithful reasoning processes, and partially flawed reasoning traces can still provide valuable learning signals.

Abstract: We present the surprising finding that a language model’s reasoning capabilities can be improved by training on synthetic datasets of chain-of-thought (CoT) traces from more capable models, even when all of those traces lead to an incorrect final answer. Our experiments show this approach can yield better performance on reasoning tasks than training on human-annotated datasets. We hypothesize that two key factors explain this phenomenon: first, the distribution of synthetic data is inherently closer to the language model’s own distribution, making it more amenable to learning. Second, these `incorrect’ traces are often only partially flawed and contain valid reasoning steps from which the model can learn. To further test the first hypothesis, we use a language model to paraphrase human-annotated traces – shifting their distribution closer to the model’s own distribution – and show that this improves performance. For the second hypothesis, we introduce increasingly flawed CoT traces and study to what extent models are tolerant to these flaws. We demonstrate our findings across various reasoning domains like math, algorithmic reasoning and code generation using MATH, GSM8K, Countdown and MBPP datasets on various language models ranging from 1.5B to 9B across Qwen, Llama, and Gemma models. Our study shows that curating datasets that are closer to the model’s distribution is a critical aspect to consider. We also show that a correct final answer is not always a reliable indicator of a faithful reasoning process.

[384] Logic Sketch Prompting (LSP): A Deterministic and Interpretable Prompting Method

Satvik Tripathi

Main category: cs.AI

TL;DR: Logic Sketch Prompting (LSP) improves LLM reliability on rule-based tasks using typed variables, condition evaluators, and validators, achieving significantly higher accuracy (0.83-0.89) than other prompting methods in pharmacologic compliance tasks.

Details

Motivation: LLMs are unreliable for tasks requiring strict rule adherence, determinism, and auditability, which is problematic for clinical, regulated, and safety-critical decision support systems.

Method: Logic Sketch Prompting (LSP) framework with typed variables, deterministic condition evaluators, and rule-based validators to produce traceable and repeatable outputs.

Result: LSP achieved highest accuracy (0.83-0.89) and F1 scores across all models, significantly outperforming zero-shot (0.24-0.60), concise (0.16-0.30), and chain-of-thought (0.56-0.75) prompting. McNemar tests showed statistical significance (p<0.01).

Conclusion: LSP improves determinism, interpretability, and consistency without sacrificing performance, making it suitable for clinical, regulated, and safety-critical decision support systems.

Abstract: Large language models (LLMs) excel at natural language reasoning but remain unreliable on tasks requiring strict rule adherence, determinism, and auditability. Logic Sketch Prompting (LSP) is a lightweight prompting framework that introduces typed variables, deterministic condition evaluators, and a rule based validator that produces traceable and repeatable outputs. Using two pharmacologic logic compliance tasks, we benchmark LSP against zero shot prompting, chain of thought prompting, and concise prompting across three open weight models: Gemma 2, Mistral, and Llama 3. Across both tasks and all models, LSP consistently achieves the highest accuracy (0.83 to 0.89) and F1 score (0.83 to 0.89), substantially outperforming zero shot prompting (0.24 to 0.60), concise prompts (0.16 to 0.30), and chain of thought prompting (0.56 to 0.75). McNemar tests show statistically significant gains for LSP across nearly all comparisons (p < 0.01). These results demonstrate that LSP improves determinism, interpretability, and consistency without sacrificing performance, supporting its use in clinical, regulated, and safety critical decision support systems.

[385] SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence

Yiheng Wang, Yixin Chen, Shuo Li, Yifan Zhou, Bo Liu, Hengjian Gao, Jiakang Yuan, Jia Bu, Wanghan Xu, Yuhao Zhou, Xiangyu Zhao, Zhiwang Zhou, Fengxiang Wang, Haodong Duan, Songyang Zhang, Jun Yao, Han Deng, Yizhou Wang, Jiabei Xiao, Jiaqi Liu, Encheng Su, Yujie Liu, Weida Wang, Junchi Yao, Shenghe Zheng, Haoran Sun, Runmin Ma, Xiangchao Yan, Bo Zhang, Dongzhan Zhou, Shufei Zhang, Peng Ye, Xiaosong Wang, Shixiang Tang, Wenlong Zhang, Lei Bai

Main category: cs.AI

TL;DR: SciEvalKit is a unified benchmarking toolkit for evaluating AI models across scientific disciplines, focusing on core scientific competencies like multimodal reasoning, symbolic reasoning, and hypothesis generation.

Details

Motivation: There's a need for specialized evaluation platforms for scientific AI models that go beyond general-purpose benchmarks, focusing on authentic scientific challenges across diverse disciplines.

Method: The toolkit builds expert-grade scientific benchmarks from real-world datasets, supports six major scientific domains, and features a flexible evaluation pipeline with batch evaluation, custom model/dataset integration, and transparent results.

Result: SciEvalKit provides a standardized yet customizable infrastructure for benchmarking scientific foundation models, bridging capability-based evaluation with disciplinary diversity.

Conclusion: The open-sourced toolkit offers a comprehensive solution for evaluating scientific AI models, fostering community-driven development in AI4Science through reproducible and comparable benchmarking.

Abstract: We introduce SciEvalKit, a unified benchmarking toolkit designed to evaluate AI models for science across a broad range of scientific disciplines and task capabilities. Unlike general-purpose evaluation platforms, SciEvalKit focuses on the core competencies of scientific intelligence, including Scientific Multimodal Perception, Scientific Multimodal Reasoning, Scientific Multimodal Understanding, Scientific Symbolic Reasoning, Scientific Code Generation, Science Hypothesis Generation and Scientific Knowledge Understanding. It supports six major scientific domains, spanning from physics and chemistry to astronomy and materials science. SciEvalKit builds a foundation of expert-grade scientific benchmarks, curated from real-world, domain-specific datasets, ensuring that tasks reflect authentic scientific challenges. The toolkit features a flexible, extensible evaluation pipeline that enables batch evaluation across models and datasets, supports custom model and dataset integration, and provides transparent, reproducible, and comparable results. By bridging capability-based evaluation and disciplinary diversity, SciEvalKit offers a standardized yet customizable infrastructure to benchmark the next generation of scientific foundation models and intelligent agents. The toolkit is open-sourced and actively maintained to foster community-driven development and progress in AI4Science.

[386] Agent2World: Learning to Generate Symbolic World Models via Adaptive Multi-Agent Feedback

Mengkang Hu, Bowei Xia, Yuran Wu, Ailing Yu, Yude Zou, Qiguang Chen, Shijian Wang, Jiarui Jin, Kexin Li, Wenxiang Jiao, Yuan Lu, Ping Luo

Main category: cs.AI

TL;DR: Agent2World is a multi-agent framework that generates executable world models using web search, implementation, and adaptive testing, achieving state-of-the-art performance and serving as a data engine for supervised fine-tuning.

Details

Motivation: Current approaches to training LLMs for symbolic world model generation suffer from lack of large-scale verifiable supervision and rely on static validation methods that miss behavior-level errors during interactive execution.

Method: Three-stage multi-agent pipeline: 1) Deep Researcher agent performs web search for knowledge synthesis, 2) Model Developer agent implements executable world models, 3) Testing Team conducts adaptive unit testing and simulation-based validation.

Result: Achieves state-of-the-art performance across three benchmarks for PDDL and executable code representations. Fine-tuned models show 30.95% average relative improvement over baseline.

Conclusion: Agent2World provides both strong inference-time world-model generation and serves as a data engine for supervised fine-tuning through multi-agent feedback, addressing limitations of current static validation approaches.

Abstract: Symbolic world models (e.g., PDDL domains or executable simulators) are central to model-based planning, but training LLMs to generate such world models is limited by the lack of large-scale verifiable supervision. Current approaches rely primarily on static validation methods that fail to catch behavior-level errors arising from interactive execution. In this paper, we propose Agent2World, a tool-augmented multi-agent framework that achieves strong inference-time world-model generation and also serves as a data engine for supervised fine-tuning, by grounding generation in multi-agent feedback. Agent2World follows a three-stage pipeline: (i) A Deep Researcher agent performs knowledge synthesis by web searching to address specification gaps; (ii) A Model Developer agent implements executable world models; And (iii) a specialized Testing Team conducts adaptive unit testing and simulation-based validation. Agent2World demonstrates superior inference-time performance across three benchmarks spanning both Planning Domain Definition Language (PDDL) and executable code representations, achieving consistent state-of-the-art results. Beyond inference, Testing Team serves as an interactive environment for the Model Developer, providing behavior-aware adaptive feedback that yields multi-turn training trajectories. The model fine-tuned on these trajectories substantially improves world-model generation, yielding an average relative gain of 30.95% over the same model before training. Project page: https://agent2world.github.io.

[387] Subgoaling Relaxation-based Heuristics for Numeric Planning with Infinite Actions

Ángel Aso-Mollar, Diego Aineto, Enrico Scala, Eva Onaindia

Main category: cs.AI

TL;DR: Numeric planning with control parameters introduces infinite action possibilities, making standard heuristics infeasible. The paper proposes an optimistic compilation approach for a tractable subset (controllable, simple numeric problems) to enable use of traditional numeric heuristics.

Details

Motivation: Standard numeric planning heuristics fail when dealing with control parameters that create infinite action spaces, limiting the applicability of existing planning techniques to problems with free numeric variables.

Method: Identify controllable, simple numeric problems as a tractable subset, then use optimistic compilation that transforms them into simple numeric tasks by abstracting control-dependent expressions into bounded constant effects and relaxed preconditions.

Result: The approach enables effective use of subgoaling heuristics to estimate goal distance in numeric planning with control parameters, demonstrating computational feasibility and pushing state-of-the-art boundaries.

Conclusion: The optimistic compilation method provides a practical way to apply traditional numeric heuristics to planning problems with infinite action spaces due to control parameters, expanding the scope of solvable numeric planning problems.

Abstract: Numeric planning with control parameters extends the standard numeric planning model by introducing action parameters as free numeric variables that must be instantiated during planning. This results in a potentially infinite number of applicable actions in a state. In this setting, off-the-shelf numeric heuristics that leverage the action structure are not feasible. In this paper, we identify a tractable subset of these problems–namely, controllable, simple numeric problems–and propose an optimistic compilation approach that transforms them into simple numeric tasks. To do so, we abstract control-dependent expressions into bounded constant effects and relaxed preconditions. The proposed compilation makes it possible to effectively use subgoaling heuristics to estimate goal distance in numeric planning problems involving control parameters. Our results demonstrate that this approach is an effective and computationally feasible way of applying traditional numeric heuristics to settings with an infinite number of possible actions, pushing the boundaries of the current state of the art.

[388] HalluMat: Detecting Hallucinations in LLM-Generated Materials Science Content Through Multi-Stage Verification

Bhanu Prakash Vangala, Sajid Mahmud, Pawan Neupane, Joel Selvaraj, Jianlin Cheng

Main category: cs.AI

TL;DR: HalluMatData benchmark and HalluMatDetector framework address LLM hallucination in materials science, reducing hallucinations by 30% with multi-stage detection and PHCS metric.

Details

Motivation: LLM hallucinations in scientific discovery compromise research integrity, especially in materials science where factual accuracy is critical for reliable knowledge generation and hypothesis formulation.

Method: Created HalluMatData benchmark dataset and developed HalluMatDetector framework with multi-stage approach: intrinsic verification, multi-source retrieval, contradiction graph analysis, and metric-based assessment. Introduced PHCS metric for quantifying inconsistencies.

Result: Hallucination levels vary significantly across materials science subdomains, with high-entropy queries showing greater inconsistencies. HalluMatDetector reduces hallucination rates by 30% compared to standard LLM outputs.

Conclusion: The proposed framework effectively detects and mitigates LLM hallucinations in materials science, enhancing research reliability through systematic verification and novel consistency metrics.

Abstract: Artificial Intelligence (AI), particularly Large Language Models (LLMs), is transforming scientific discovery, enabling rapid knowledge generation and hypothesis formulation. However, a critical challenge is hallucination, where LLMs generate factually incorrect or misleading information, compromising research integrity. To address this, we introduce HalluMatData, a benchmark dataset for evaluating hallucination detection methods, factual consistency, and response robustness in AI-generated materials science content. Alongside this, we propose HalluMatDetector, a multi-stage hallucination detection framework that integrates intrinsic verification, multi-source retrieval, contradiction graph analysis, and metric-based assessment to detect and mitigate LLM hallucinations. Our findings reveal that hallucination levels vary significantly across materials science subdomains, with high-entropy queries exhibiting greater factual inconsistencies. By utilizing HalluMatDetector verification pipeline, we reduce hallucination rates by 30% compared to standard LLM outputs. Furthermore, we introduce the Paraphrased Hallucination Consistency Score (PHCS) to quantify inconsistencies in LLM responses across semantically equivalent queries, offering deeper insights into model reliability.

[389] The Wisdom of Deliberating AI Crowds: Does Deliberation Improve LLM-Based Forecasting?

Paul Schneider, Amalie Schramm

Main category: cs.AI

TL;DR: LLM deliberation (reviewing each other’s forecasts) improves accuracy in diverse models with shared information but not in homogeneous groups, suggesting deliberation can enhance LLM forecasting.

Details

Motivation: To investigate whether structured deliberation (similar to what improves human forecasting) can enhance LLM forecasting accuracy by allowing models to review each other's forecasts before updating.

Method: Tested GPT-5, Claude Sonnet 4.5, and Gemini Pro 2.5 on 202 resolved binary questions from Metaculus Q2 2025 AI Forecasting Tournament across four scenarios: diverse models with distributed/shared information and homogeneous models with distributed/shared information.

Result: Deliberation significantly improved accuracy in diverse models with shared information (reducing Log Loss by 0.020, ~4% relative improvement, p=0.017). No benefit observed for homogeneous groups. Additional contextual information didn’t improve accuracy.

Conclusion: Deliberation may be a viable strategy for improving LLM forecasting, particularly when diverse models share information, but homogeneous groups don’t benefit from the same process.

Abstract: Structured deliberation has been found to improve the performance of human forecasters. This study investigates whether a similar intervention, i.e. allowing LLMs to review each other’s forecasts before updating, can improve accuracy in large language models (GPT-5, Claude Sonnet 4.5, Gemini Pro 2.5). Using 202 resolved binary questions from the Metaculus Q2 2025 AI Forecasting Tournament, accuracy was assessed across four scenarios: (1) diverse models with distributed information, (2) diverse models with shared information, (3) homogeneous models with distributed information, and (4) homogeneous models with shared information. Results show that the intervention significantly improves accuracy in scenario (2), reducing Log Loss by 0.020 or about 4 percent in relative terms (p = 0.017). However, when homogeneous groups (three instances of the same model) engaged in the same process, no benefit was observed. Unexpectedly, providing LLMs with additional contextual information did not improve forecast accuracy, limiting our ability to study information pooling as a mechanism. Our findings suggest that deliberation may be a viable strategy for improving LLM forecasting.

[390] Lightweight Inference-Time Personalization for Frozen Knowledge Graph Embeddings

Ozan Oguztuzun, Cerag Oguztuzun

Main category: cs.AI

TL;DR: GatedBias is a lightweight inference-time personalization framework that adapts frozen knowledge graph embeddings to individual user preferences using structure-gated adaptation with only ~300 parameters, improving personalization while preserving global accuracy.

Details

Motivation: Foundation models for knowledge graphs achieve strong cohort-level link prediction but fail to capture individual user preferences, creating a disconnect between general relational reasoning and personalized ranking.

Method: GatedBias uses structure-gated adaptation: profile-specific features combine with graph-derived binary gates to produce interpretable, per-entity biases. It adapts frozen KG embeddings at inference time without retraining, requiring only ~300 trainable parameters.

Result: Evaluation on Amazon-Book and Last-FM datasets shows statistically significant improvements in alignment metrics while preserving cohort performance. Counterfactual perturbation experiments show 6-30× greater rank improvements for entities benefiting from specific preference signals.

Conclusion: Personalized adaptation of foundation models can be both parameter-efficient and causally verifiable, bridging general knowledge representations with individual user needs.

Abstract: Foundation models for knowledge graphs (KGs) achieve strong cohort-level performance in link prediction, yet fail to capture individual user preferences; a key disconnect between general relational reasoning and personalized ranking. We propose GatedBias, a lightweight inference-time personalization framework that adapts frozen KG embeddings to individual user contexts without retraining or compromising global accuracy. Our approach introduces structure-gated adaptation: profile-specific features combine with graph-derived binary gates to produce interpretable, per-entity biases, requiring only ${\sim}300$ trainable parameters. We evaluate GatedBias on two benchmark datasets (Amazon-Book and Last-FM), demonstrating statistically significant improvements in alignment metrics while preserving cohort performance. Counterfactual perturbation experiments validate causal responsiveness; entities benefiting from specific preference signals show 6–30$\times$ greater rank improvements when those signals are boosted. These results show that personalized adaptation of foundation models can be both parameter-efficient and causally verifiable, bridging general knowledge representations with individual user needs.

[391] Monadic Context Engineering

Yifan Zhang, Mengdi Wang

Main category: cs.AI

TL;DR: MCE introduces a formal monadic framework for building robust AI agents by treating workflows as computational contexts with algebraic structures, enabling systematic composition and error handling.

Details

Motivation: Current LLM-based agent architectures use brittle, ad hoc patterns that struggle with state management, error handling, and concurrency, requiring a more principled foundation.

Method: Monadic Context Engineering (MCE) leverages Functors, Applicative Functors, and Monads to structure agent workflows as computational contexts, using Monad Transformers for systematic composition of capabilities.

Result: Enables construction of complex, resilient AI agents from simple verifiable components, with extensions to Meta-Agents for generative orchestration of sub-agent workflows.

Conclusion: MCE provides a formal algebraic foundation for agent design that addresses brittleness in current systems through principled composition of computational contexts.

Abstract: The proliferation of Large Language Models (LLMs) has catalyzed a shift towards autonomous agents capable of complex reasoning and tool use. However, current agent architectures are frequently constructed using imperative, ad hoc patterns. This results in brittle systems plagued by difficulties in state management, error handling, and concurrency. This paper introduces Monadic Context Engineering (MCE), a novel architectural paradigm leveraging the algebraic structures of Functors, Applicative Functors, and Monads to provide a formal foundation for agent design. MCE treats agent workflows as computational contexts where cross-cutting concerns, such as state propagation, short-circuiting error handling, and asynchronous execution, are managed intrinsically by the algebraic properties of the abstraction. We demonstrate how Monads enable robust sequential composition, how Applicatives provide a principled structure for parallel execution, and crucially, how Monad Transformers allow for the systematic composition of these capabilities. This layered approach enables developers to construct complex, resilient, and efficient AI agents from simple, independently verifiable components. We further extend this framework to describe Meta-Agents, which leverage MCE for generative orchestration, dynamically creating and managing sub-agent workflows through metaprogramming. Project Page: https://github.com/yifanzhang-pro/monadic-context-engineering.

[392] DarkPatterns-LLM: A Multi-Layer Benchmark for Detecting Manipulative and Harmful AI Behavior

Sadia Asif, Israel Antonio Rosales Laguan, Haris Khan, Shumaila Asif, Muneeb Asif

Main category: cs.AI

TL;DR: DarkPatterns-LLM is a new benchmark for detecting manipulative content in LLM outputs across seven harm categories, using a four-layer analytical framework and 401 expert-annotated examples.

Details

Motivation: Current safety benchmarks use coarse binary labels that fail to capture nuanced psychological and social mechanisms of manipulation in LLM outputs, creating concerns about user autonomy, trust, and well-being.

Method: Created a comprehensive benchmark dataset with 401 curated examples and a four-layer analytical pipeline: Multi-Granular Detection (MGD), Multi-Scale Intent Analysis (MSIAN), Threat Harmonization Protocol (THP), and Deep Contextual Risk Alignment (DCRA).

Result: Evaluation of state-of-the-art models (GPT-4, Claude 3.5, LLaMA-3-70B) showed significant performance disparities (65.2%–89.7%) and consistent weaknesses in detecting autonomy-undermining patterns.

Conclusion: DarkPatterns-LLM establishes the first standardized, multi-dimensional benchmark for manipulation detection in LLMs, offering actionable diagnostics for developing more trustworthy AI systems.

Abstract: The proliferation of Large Language Models (LLMs) has intensified concerns about manipulative or deceptive behaviors that can undermine user autonomy, trust, and well-being. Existing safety benchmarks predominantly rely on coarse binary labels and fail to capture the nuanced psychological and social mechanisms constituting manipulation. We introduce \textbf{DarkPatterns-LLM}, a comprehensive benchmark dataset and diagnostic framework for fine-grained assessment of manipulative content in LLM outputs across seven harm categories: Legal/Power, Psychological, Emotional, Physical, Autonomy, Economic, and Societal Harm. Our framework implements a four-layer analytical pipeline comprising Multi-Granular Detection (MGD), Multi-Scale Intent Analysis (MSIAN), Threat Harmonization Protocol (THP), and Deep Contextual Risk Alignment (DCRA). The dataset contains 401 meticulously curated examples with instruction-response pairs and expert annotations. Through evaluation of state-of-the-art models including GPT-4, Claude 3.5, and LLaMA-3-70B, we observe significant performance disparities (65.2%–89.7%) and consistent weaknesses in detecting autonomy-undermining patterns. DarkPatterns-LLM establishes the first standardized, multi-dimensional benchmark for manipulation detection in LLMs, offering actionable diagnostics toward more trustworthy AI systems.

[393] Multi-AI Agent Framework Reveals the “Oxide Gatekeeper” in Aluminum Nanoparticle Oxidation

Yiming Lu, Tingyu Lu, Di Zhang, Lili Ye, Hao Li

Main category: cs.AI

TL;DR: AI-human collaborative framework creates quantum-accurate ML potential for million-atom aluminum nanoparticle simulations, revealing temperature-dependent oxidation mechanisms and resolving cation vs oxygen diffusion debate.

Details

Motivation: Current computational methods for aluminum nanoparticles face limitations: ab initio methods are too small-scale, while empirical force fields lack reactive fidelity for complex combustion environments, leaving atomic mechanisms of explosive transition poorly understood.

Method: Developed a “human-in-the-loop” closed-loop framework where self-auditing AI Agents validate machine learning potential evolution, acting as scientific sentinels to visualize hidden model artifacts for human decision-making, ensuring quantum mechanical accuracy with near-linear scalability.

Result: Achieved quantum accuracy (energy RMSE: 1.2 meV/atom, force RMSE: 0.126 eV/Angstrom) with million-atom systems at nanosecond timescales. Discovered temperature-regulated dual-mode oxidation: “gatekeeper” breathing mode at moderate temperatures and catastrophic “rupture mode” above critical threshold. Resolved controversy showing aluminum cation outward diffusion dominates mass transfer (2-3 orders faster than oxygen) across all temperatures.

Conclusion: Established unified atomic-scale framework for energetic nanomaterial design, enabling precision engineering of ignition sensitivity and energy release rates through intelligent computational design, bridging quantum accuracy with large-scale simulations.

Abstract: Aluminum nanoparticles (ANPs) are among the most energy-dense solid fuels, yet the atomic mechanisms governing their transition from passivated particles to explosive reactants remain elusive. This stems from a fundamental computational bottleneck: ab initio methods offer quantum accuracy but are restricted to small spatiotemporal scales (< 500 atoms, picoseconds), while empirical force fields lack the reactive fidelity required for complex combustion environments. Herein, we bridge this gap by employing a “human-in-the-loop” closed-loop framework where self-auditing AI Agents validate the evolution of a machine learning potential (MLP). By acting as scientific sentinels that visualize hidden model artifacts for human decision-making, this collaborative cycle ensures quantum mechanical accuracy while exhibiting near-linear scalability to million-atom systems and accessing nanosecond timescales (energy RMSE: 1.2 meV/atom, force RMSE: 0.126 eV/Angstrom). Strikingly, our simulations reveal a temperature-regulated dual-mode oxidation mechanism: at moderate temperatures, the oxide shell acts as a dynamic “gatekeeper,” regulating oxidation through a “breathing mode” of transient nanochannels; above a critical threshold, a “rupture mode” unleashes catastrophic shell failure and explosive combustion. Importantly, we resolve a decades-old controversy by demonstrating that aluminum cation outward diffusion, rather than oxygen transport, dominates mass transfer across all temperature regimes, with diffusion coefficients consistently exceeding those of oxygen by 2-3 orders of magnitude. These discoveries establish a unified atomic-scale framework for energetic nanomaterial design, enabling the precision engineering of ignition sensitivity and energy release rates through intelligent computational design.

[394] SPIRAL: Symbolic LLM Planning via Grounded and Reflective Search

Yifan Zhang, Giridhar Ganapavarapu, Srideepika Jayaraman, Bhavna Agrawal, Dhaval Patel, Achille Fokoue

Main category: cs.AI

TL;DR: SPIRAL is a novel LLM planning framework that embeds three specialized LLM agents into MCTS for better complex task planning through creative proposal, realistic simulation, and reflective criticism.

Details

Motivation: LLMs struggle with complex planning tasks requiring exploration and self-correction due to linear reasoning that can't recover from early mistakes, while traditional search algorithms like MCTS are ineffective with sparse rewards and fail to leverage LLMs' semantic capabilities.

Method: SPIRAL embeds three specialized LLM agents into an MCTS loop: Planner proposes creative next steps, Simulator grounds search by predicting realistic outcomes, and Critic provides dense reward signals through reflection, transforming MCTS into guided, self-correcting reasoning.

Result: On DailyLifeAPIs and HuggingFace datasets, SPIRAL consistently outperforms Chain-of-Thought planning and other state-of-the-art agents, achieving 83.6% overall accuracy on DailyLifeAPIs (16+ percentage point improvement) with superior token efficiency.

Conclusion: Structuring LLM reasoning as guided, reflective, and grounded search process yields more robust and efficient autonomous planners, demonstrating the effectiveness of integrated planning pipelines with specialized agents.

Abstract: Large Language Models (LLMs) often falter at complex planning tasks that require exploration and self-correction, as their linear reasoning process struggles to recover from early mistakes. While search algorithms like Monte Carlo Tree Search (MCTS) can explore alternatives, they are often ineffective when guided by sparse rewards and fail to leverage the rich semantic capabilities of LLMs. We introduce SPIRAL (Symbolic LLM Planning via Grounded and Reflective Search), a novel framework that embeds a cognitive architecture of three specialized LLM agents into an MCTS loop. SPIRAL’s key contribution is its integrated planning pipeline where a Planner proposes creative next steps, a Simulator grounds the search by predicting realistic outcomes, and a Critic provides dense reward signals through reflection. This synergy transforms MCTS from a brute-force search into a guided, self-correcting reasoning process. On the DailyLifeAPIs and HuggingFace datasets, SPIRAL consistently outperforms the default Chain-of-Thought planning method and other state-of-the-art agents. More importantly, it substantially surpasses other state-of-the-art agents; for example, SPIRAL achieves 83.6% overall accuracy on DailyLifeAPIs, an improvement of over 16 percentage points against the next-best search framework, while also demonstrating superior token efficiency. Our work demonstrates that structuring LLM reasoning as a guided, reflective, and grounded search process yields more robust and efficient autonomous planners. The source code, full appendices, and all experimental data are available for reproducibility at the official project repository.

[395] Lessons from Neuroscience for AI: How integrating Actions, Compositional Structure and Episodic Memory could enable Safe, Interpretable and Human-Like AI

Rajesh P. N. Rao, Vishwas Sathish, Linxing Preston Jiang, Matthew Bryan, Prashant Rangarajan

Main category: cs.AI

TL;DR: Foundation models should integrate actions, hierarchical composition, and episodic memory to achieve safer, more interpretable, and human-like AI, addressing current limitations like hallucinations and lack of grounding.

Details

Motivation: Current foundation models (LLMs) are based on predictive coding but ignore three crucial components from neuroscience: action integration, hierarchical composition, and episodic memory. These omissions lead to deficiencies like hallucinations, superficial understanding, lack of agency, safety threats, and energy inefficiency.

Method: Proposes integrating three missing components from neuroscience into foundation models: 1) tight integration of actions with generative models, 2) hierarchical compositional structure, and 3) episodic memory. Compares this approach to current trends like chain-of-thought reasoning and retrieval-augmented generation.

Result: The paper presents evidence from neuroscience supporting the importance of each component and argues that adding these brain-inspired elements could address current AI deficiencies. It suggests new ways to augment foundation models with these components.

Conclusion: A renewed exchange between brain science and AI will help develop safe, interpretable, human-centered AI by integrating action, composition, and memory into foundation models, moving beyond current predictive coding limitations.

Abstract: The phenomenal advances in large language models (LLMs) and other foundation models over the past few years have been based on optimizing large-scale transformer models on the surprisingly simple objective of minimizing next-token prediction loss, a form of predictive coding that is also the backbone of an increasingly popular model of brain function in neuroscience and cognitive science. However, current foundation models ignore three other important components of state-of-the-art predictive coding models: tight integration of actions with generative models, hierarchical compositional structure, and episodic memory. We propose that to achieve safe, interpretable, energy-efficient, and human-like AI, foundation models should integrate actions, at multiple scales of abstraction, with a compositional generative architecture and episodic memory. We present recent evidence from neuroscience and cognitive science on the importance of each of these components. We describe how the addition of these missing components to foundation models could help address some of their current deficiencies: hallucinations and superficial understanding of concepts due to lack of grounding, a missing sense of agency/responsibility due to lack of control, threats to safety and trustworthiness due to lack of interpretability, and energy inefficiency. We compare our proposal to current trends, such as adding chain-of-thought (CoT) reasoning and retrieval-augmented generation (RAG) to foundation models, and discuss new ways of augmenting these models with brain-inspired components. We conclude by arguing that a rekindling of the historically fruitful exchange of ideas between brain science and AI will help pave the way towards safe and interpretable human-centered AI.

[396] SANet: A Semantic-aware Agentic AI Networking Framework for Cross-layer Optimization in 6G

Yong Xiao, Xubo Li, Haoran Zhou, Yingyu Li, Yayu Gao, Guangming Shi, Ping Zhang, Marwan Krunz

Main category: cs.AI

TL;DR: SANet is a semantic-aware AgentNet architecture for wireless networks that uses AI agents to infer user goals and coordinate across network layers, with a MoPS framework for efficient model sharing and decentralized optimization.

Details

Motivation: Agentic AI networking enables autonomous network management, but faces challenges with decentralized agents having potentially conflicting objectives. The paper aims to address this by developing a semantic-aware architecture that can handle multi-agent, multi-objective optimization in wireless networks.

Method: Proposes SANet architecture with semantic goal inference, formulates decentralized optimization as multi-agent multi-objective problem, develops MoPS framework for model partitioning/sharing, proposes two decentralized optimization algorithms, and implements hardware prototype across RAN and core network layers.

Result: Experimental results show 14.61% performance gains while requiring only 44.37% of FLOPs compared to state-of-the-art algorithms. Theoretical analysis reveals three-way tradeoff among optimization, generalization, and conflicting errors.

Conclusion: SANet successfully demonstrates semantic-aware AgentNet for wireless networks with efficient decentralized optimization, achieving significant performance improvements with reduced computational requirements while addressing conflicting objectives among collaborating agents.

Abstract: Agentic AI networking (AgentNet) is a novel AI-native networking paradigm in which a large number of specialized AI agents collaborate to perform autonomous decision-making, dynamic environmental adaptation, and complex missions. It has the potential to facilitate real-time network management and optimization functions, including self-configuration, self-optimization, and self-adaptation across diverse and complex environments. This paper proposes SANet, a novel semantic-aware AgentNet architecture for wireless networks that can infer the semantic goal of the user and automatically assign agents associated with different layers of the network to fulfill the inferred goal. Motivated by the fact that AgentNet is a decentralized framework in which collaborating agents may generally have different and even conflicting objectives, we formulate the decentralized optimization of SANet as a multi-agent multi-objective problem, and focus on finding the Pareto-optimal solution for agents with distinct and potentially conflicting objectives. We propose three novel metrics for evaluating SANet. Furthermore, we develop a model partition and sharing (MoPS) framework in which large models, e.g., deep learning models, of different agents can be partitioned into shared and agent-specific parts that are jointly constructed and deployed according to agents’ local computational resources. Two decentralized optimization algorithms are proposed. We derive theoretical bounds and prove that there exists a three-way tradeoff among optimization, generalization, and conflicting errors. We develop an open-source RAN and core network-based hardware prototype that implements agents to interact with three different layers of the network. Experimental results show that the proposed framework achieved performance gains of up to 14.61% while requiring only 44.37% of FLOPs required by state-of-the-art algorithms.

[397] Tyee: A Unified, Modular, and Fully-Integrated Configurable Toolkit for Intelligent Physiological Health Care

Tao Zhou, Lingyu Shu, Zixing Zhang, Jing Han

Main category: cs.AI

TL;DR: Tyee is a unified toolkit for physiological signal analysis that addresses data heterogeneity, preprocessing inconsistencies, and reproducibility issues through a configurable, modular framework.

Details

Motivation: Deep learning progress in physiological signal analysis is hindered by heterogeneous data formats, inconsistent preprocessing strategies, fragmented model pipelines, and non-reproducible experimental setups.

Method: Tyee introduces three key innovations: 1) unified data interface and configurable preprocessing for 12 signal modalities, 2) modular/extensible architecture for flexible integration, and 3) end-to-end workflow configuration for reproducibility.

Result: Tyee demonstrates consistent practical effectiveness and generalizability, outperforming or matching baselines across all evaluated tasks, achieving state-of-the-art results on 12 of 13 datasets.

Conclusion: Tyee provides a comprehensive solution for intelligent physiological healthcare with open-source availability and active maintenance, addressing key reproducibility and scalability challenges in the field.

Abstract: Deep learning has shown great promise in physiological signal analysis, yet its progress is hindered by heterogeneous data formats, inconsistent preprocessing strategies, fragmented model pipelines, and non-reproducible experimental setups. To address these limitations, we present Tyee, a unified, modular, and fully-integrated configurable toolkit designed for intelligent physiological healthcare. Tyee introduces three key innovations: (1) a unified data interface and configurable preprocessing pipeline for 12 kinds of signal modalities; (2) a modular and extensible architecture enabling flexible integration and rapid prototyping across tasks; and (3) end-to-end workflow configuration, promoting reproducible and scalable experimentation. Tyee demonstrates consistent practical effectiveness and generalizability, outperforming or matching baselines across all evaluated tasks (with state-of-the-art results on 12 of 13 datasets). The Tyee toolkit is released at https://github.com/SmileHnu/Tyee and actively maintained.

Junshu Dai, Yu Wang, Tongya Zheng, Wei Ji, Qinghong Guo, Ji Cao, Jie Song, Canghong Jin, Mingli Song

Main category: cs.AI

TL;DR: M³ob: Multi-modal mobility prediction using LLM-enhanced spatial-temporal knowledge graphs to bridge semantic gaps and improve generalization for location recommendation.

Details

Motivation: Existing human mobility prediction methods have limited generalization - unimodal approaches suffer from data sparsity and biases, while multi-modal methods struggle to capture mobility dynamics due to semantic gaps between static multi-modal representations and spatial-temporal dynamics.

Method: 1) Construct unified spatial-temporal relational graph (STRG) using LLM-enhanced spatial-temporal knowledge graph (STKG) to capture functional semantics and spatial-temporal knowledge. 2) Design gating mechanism to fuse spatial-temporal graph representations of different modalities. 3) Propose STKG-guided cross-modal alignment to inject spatial-temporal dynamic knowledge into static image modality.

Result: Extensive experiments on six public datasets show consistent improvements in normal scenarios and significant generalization ability in abnormal scenarios.

Conclusion: The proposed M³ob framework effectively leverages multi-modal spatial-temporal knowledge to characterize mobility dynamics, addressing limitations of existing methods and demonstrating strong performance and generalization for location recommendation tasks.

Abstract: The precise prediction of human mobility has produced significant socioeconomic impacts, such as location recommendations and evacuation suggestions. However, existing methods suffer from limited generalization capability: unimodal approaches are constrained by data sparsity and inherent biases, while multi-modal methods struggle to effectively capture mobility dynamics caused by the semantic gap between static multi-modal representation and spatial-temporal dynamics. Therefore, we leverage multi-modal spatial-temporal knowledge to characterize mobility dynamics for the location recommendation task, dubbed as \textbf{M}ulti-\textbf{M}odal \textbf{Mob}ility (\textbf{M}$^3$\textbf{ob}). First, we construct a unified spatial-temporal relational graph (STRG) for multi-modal representation, by leveraging the functional semantics and spatial-temporal knowledge captured by the large language models (LLMs)-enhanced spatial-temporal knowledge graph (STKG). Second, we design a gating mechanism to fuse spatial-temporal graph representations of different modalities, and propose an STKG-guided cross-modal alignment to inject spatial-temporal dynamic knowledge into the static image modality. Extensive experiments on six public datasets show that our proposed method not only achieves consistent improvements in normal scenarios but also exhibits significant generalization ability in abnormal scenarios.

[399] LLM Agents as VC investors: Predicting Startup Success via RolePlay-Based Collective Simulation

Zhongyang Liu, Haoyu Pei, Xiangyi Xiao, Xiaocong Du, Yihui Li, Suting Hong, Kunpeng Zhang, Haipeng Zhang

Main category: cs.AI

TL;DR: SimVC-CAS: A multi-agent system that simulates venture capital decision-making by modeling investor groups, improving startup success prediction accuracy by ~25%.

Details

Motivation: Startup success prediction is critical but existing approaches overlook collective investor dynamics in real-world VC decisions, which are dominated by investor groups rather than single decision-makers.

Method: Proposes SimVC-CAS, a collective agent system that simulates VC decision-making as multi-agent interactions. Uses role-playing agents with unique traits/preferences and a GNN-based supervised interaction module to capture enterprise fundamentals and investor network dynamics through a graph-structured co-investment network.

Result: Using real-world PitchBook data with strict data leakage controls, SimVC-CAS shows significant predictive accuracy improvements (~25% relative improvement in average precision@10) while providing interpretable, multi-perspective reasoning.

Conclusion: The approach successfully models collective VC decision-making, improves startup financing prediction, and offers insights applicable to other complex group decision scenarios beyond venture capital.

Abstract: Due to the high value and high failure rate of startups, predicting their success has become a critical challenge across interdisciplinary research. Existing approaches typically model success prediction from the perspective of a single decision-maker, overlooking the collective dynamics of investor groups that dominate real-world venture capital (VC) decisions. In this paper, we propose SimVC-CAS, a novel collective agent system that simulates VC decision-making as a multi-agent interaction process. By designing role-playing agents and a GNN-based supervised interaction module, we reformulate startup financing prediction as a group decision-making task, capturing both enterprise fundamentals and the behavioral dynamics of potential investor networks. Each agent embodies an investor with unique traits and preferences, enabling heterogeneous evaluation and realistic information exchange through a graph-structured co-investment network. Using real-world data from PitchBook and under strict data leakage controls, we show that SimVC-CAS significantly improves predictive accuracy while providing interpretable, multiperspective reasoning, for example, approximately 25% relative improvement with respect to average precision@10. SimVC-CAS also sheds light on other complex group decision scenarios.

[400] DICE: Discrete Interpretable Comparative Evaluation with Probabilistic Scoring for Retrieval-Augmented Generation

Shiyan Liu, Jian Ma, Rui Qu

Main category: cs.AI

TL;DR: DICE is a two-stage evaluation framework for RAG systems that combines deep analytical reasoning with probabilistic scoring to provide explainable, confidence-aware judgments while reducing computational complexity through Swiss-system tournaments.

Details

Motivation: Existing RAG evaluation metrics have limited interpretability, inadequate uncertainty quantification, and computational inefficiency in multi-system comparisons, hindering responsible deployment of RAG technologies.

Method: Two-stage evidence-coupled framework combining deep analytical reasoning with probabilistic {A, B, Tie} scoring, using Swiss-system tournament to reduce computational complexity from O(N²) to O(N log N).

Result: Achieves 85.7% agreement with human experts on Chinese financial QA dataset, outperforming existing LLM-based metrics like RAGAS, with 42.9% computational reduction in eight-system evaluation.

Conclusion: DICE establishes a responsible, explainable, and efficient paradigm for trustworthy RAG system assessment with transparent reasoning traces for systematic error diagnosis.

Abstract: As Retrieval-Augmented Generation (RAG) systems evolve toward more sophisticated architectures, ensuring their trustworthiness through explainable and robust evaluation becomes critical. Existing scalar metrics suffer from limited interpretability, inadequate uncertainty quantification, and computational inefficiency in multi-system comparisons, hindering responsible deployment of RAG technologies. We introduce DICE (Discrete Interpretable Comparative Evaluation), a two-stage, evidence-coupled framework that advances explainability and robustness in RAG evaluation. DICE combines deep analytical reasoning with probabilistic ${A, B, Tie}$ scoring to produce transparent, confidence-aware judgments that support accountable system improvement through interpretable reasoning traces, enabling systematic error diagnosis and actionable insights. To address efficiency challenges at scale, DICE employs a Swiss-system tournament that reduces computational complexity from $O(N^2)$ to $O(N \log N)$, achieving a 42.9% reduction in our eight-system evaluation while preserving ranking fidelity. Validation on a curated Chinese financial QA dataset demonstrates that DICE achieves 85.7% agreement with human experts, substantially outperforming existing LLM-based metrics such as RAGAS. Our results establish DICE as a responsible, explainable, and efficient paradigm for trustworthy RAG system assessment.

[401] TravelBench: A Real-World Benchmark for Multi-Turn and Tool-Augmented Travel Planning

Xiang Cheng, Yulan Hu, Xiangwen Zhang, Lu Xu, Zheng Pan, Xin Li, Yong Liu

Main category: cs.AI

TL;DR: TravelBench is a new real-world travel planning benchmark with multi-turn interaction and tool use to evaluate LLM agent capabilities more comprehensively than existing limited benchmarks.

Details

Motivation: Existing travel planning benchmarks are limited in domain coverage and multi-turn interaction, failing to support dynamic user-agent interaction and comprehensive assessment of LLM agent capabilities in real-world scenarios.

Method: Collected user requests from real-world scenarios and constructed three subsets (multi-turn, single-turn, unsolvable). Built a controlled sandbox environment with 10 travel-domain tools providing deterministic outputs for reliable evaluation.

Result: Evaluated multiple LLMs on TravelBench and conducted analysis of their behaviors and performance, demonstrating the benchmark’s practical utility.

Conclusion: TravelBench offers a practical and reproducible benchmark for advancing LLM agents in travel planning, addressing limitations of prior work through real-world scenarios and controlled evaluation environment.

Abstract: Large language model (LLM) agents have demonstrated strong capabilities in planning and tool use. Travel planning provides a natural and high-impact testbed for these capabilities, as it requires multi-step reasoning, iterative preference elicitation through interaction, and calls to external tools under evolving constraints. Prior work has studied LLMs on travel-planning tasks, but existing settings are limited in domain coverage and multi-turn interaction. As a result, they cannot support dynamic user-agent interaction and therefore fail to comprehensively assess agent capabilities. In this paper, we introduce TravelBench, a real-world travel-planning benchmark featuring multi-turn interaction and tool use. We collect user requests from real-world scenarios and construct three subsets-multi-turn, single-turn, and unsolvable-to evaluate different aspects of agent performance. For stable and reproducible evaluation, we build a controlled sandbox environment with 10 travel-domain tools, providing deterministic tool outputs for reliable reasoning. We evaluate multiple LLMs on TravelBench and conduct an analysis of their behaviors and performance. TravelBench offers a practical and reproducible benchmark for advancing LLM agents in travel planning.

[402] Memento-II: Learning by Stateful Reflective Memory

Jun Wang

Main category: cs.AI

TL;DR: A theoretical framework for continual learning in LLM agents that uses episodic memory and reflection instead of backpropagation, enabling adaptation without parameter updates.

Details

Motivation: To enable large language model agents to learn continually through interaction without requiring backpropagation or model fine-tuning, bridging the gap between training and deployment phases.

Method: Introduces Stateful Reflective Decision Process (SRDP) that models reflective learning as a two-stage read-write interaction with episodic memory: writing stores outcomes (policy evaluation) and reading retrieves past cases (policy improvement).

Result: The process induces an equivalent Markov Decision Process over augmented state-memory representations, allowing use of classical RL tools. With entropy-regularized policy iteration, convergence guarantees are established - as episodic memory grows and covers state space sufficiently, the policy converges to optimal.

Conclusion: Provides a principled foundation for memory-augmented, retrieval-based LLM agents capable of continual adaptation without parameter updates, using reflection as the key mechanism for learning.

Abstract: We propose a theoretical framework for continual and experiential learning in large language model agents that integrates episodic memory with reinforcement learning. The framework identifies reflection as the key mechanism that enables agents to adapt through interaction without back propagation or model fine tuning, thereby relaxing the conventional separation between training and deployment.To formalise this process, we introduce the Stateful Reflective Decision Process, which models reflective learning as a two stage read write interaction with episodic memory. Writing stores interaction outcomes and corresponds to policy evaluation, while reading retrieves relevant past cases and corresponds to policy improvement. We show that this process induces an equivalent Markov decision process over augmented state memory representations, allowing the use of classical tools from dynamic programming and reinforcement learning. We further instantiate the framework using entropy regularised policy iteration and establish convergence guarantees. As episodic memory grows and achieves sufficient coverage of the state space, the resulting policy converges to the optimal solution. This work provides a principled foundation for memory augmented and retrieval based language model agents capable of continual adaptation without parameter updates.

[403] Scaling Clinician-Grade Feature Generation from Clinical Notes with Multi-Agent Language Models

Jiayi Wang, Jacqueline Jil Vallon, Nikhil V. Kotha, Neil Panjwani, Xi Ling, Margaret Redfield, Sushmita Vij, Sandy Srinivas, John Leppert, Mark K. Buyyounouski, Mohsen Bayati

Main category: cs.AI

TL;DR: SNOW is a multi-agent LLM system that automates expert-level feature extraction from clinical notes, achieving comparable performance to manual clinical abstraction while reducing human effort by 48x and demonstrating cross-domain generalizability.

Details

Motivation: Clinical prediction models are bottlenecked by the labor-intensive manual extraction of structured features from unstructured EHR notes, which requires domain expertise and doesn't scale.

Method: Developed SNOW (Scalable Note-to-Outcome Workflow) - a transparent multi-agent LLM system that mimics clinical experts’ iterative reasoning and validation workflow to autonomously extract features from clinical notes.

Result: SNOW achieved AUC-ROC 0.767 for 5-year prostate cancer recurrence prediction (comparable to manual CFG’s 0.762), reduced human effort by 48-fold (12 hours vs manual CFG), and successfully generalized to HFpEF mortality prediction without task-specific tuning (AUC-ROC 0.851 for 30-day, 0.763 for 1-year).

Conclusion: Modular LLM agent-based systems can scale expert-level feature generation from clinical notes while maintaining interpretability and generalizability across different clinical settings and conditions.

Abstract: Developing accurate clinical prediction models is often bottlenecked by the difficulty of deriving meaningful structured features from unstructured EHR notes, a process that traditionally requires manual, unscalable clinical abstraction. In this study, we first established a rigorous patient-level Clinician Feature Generation (CFG) protocol, in which domain experts manually reviewed notes to define and extract nuanced features for a cohort of 147 patients with prostate cancer. As a high-fidelity ground truth, this labor-intensive process provided the blueprint for SNOW (Scalable Note-to-Outcome Workflow), a transparent multi-agent large language model (LLM) system designed to autonomously mimic the iterative reasoning and validation workflow of clinical experts. On 5-year cancer recurrence prediction, SNOW (AUC-ROC 0.767) achieved performance comparable to manual CFG (0.762) and outperformed structured baselines, clinician-guided LLM extraction, and six representational feature generation (RFG) approaches. Once configured, SNOW produced the full patient-level feature table in 12 hours with 5 hours of clinician oversight, reducing human expert effort by approximately 48-fold versus manual CFG. To test scalability where manual CFG is infeasible, we deployed SNOW on an external heart failure with preserved ejection fraction (HFpEF) cohort from MIMIC-IV (n=2,084); without task-specific tuning, SNOW generated prognostic features that outperformed baseline and RFG methods for 30-day (SNOW: 0.851) and 1-year (SNOW: 0.763) mortality prediction. These results demonstrate that a modular LLM agent-based system can scale expert-level feature generation from clinical notes, while enabling interpretable use of unstructured EHR text in outcome prediction and preserving generalizability across a variety of settings and conditions.

[404] SAMP-HDRL: Segmented Allocation with Momentum-Adjusted Utility for Multi-agent Portfolio Management via Hierarchical Deep Reinforcement Learning

Xiaotian Ren, Nuerxiati Abudurexiti, Zhengyong Jiang, Angelos Stefanidis, Hongbin Liu, Jionglong Su

Main category: cs.AI

TL;DR: SAMP-HDRL is a hierarchical DRL framework for portfolio management that uses dynamic asset grouping, upper-lower agent coordination, and utility-based capital allocation to handle non-stationary markets with improved performance and interpretability.

Details

Motivation: Portfolio optimization faces challenges in non-stationary markets due to regime shifts, dynamic correlations, and limited interpretability of DRL policies. Existing methods struggle with adaptability and transparency in complex financial environments.

Method: Proposes SAMP-HDRL framework: 1) Dynamic asset grouping partitions market into high-quality/ordinary subsets, 2) Upper-level agent extracts global market signals, 3) Lower-level agents perform intra-group allocation under mask constraints, 4) Utility-based capital allocation integrates risky/risk-free assets for coherent coordination.

Result: Outperforms 9 traditional baselines and 9 DRL benchmarks across three market regimes (2019-2021). Achieves at least 5% higher Return, Sharpe ratio, Sortino ratio, and 2% higher Omega ratio vs strongest baseline, with larger gains in turbulent markets. SHAP analysis reveals complementary “diversified + concentrated” mechanism.

Conclusion: SAMP-HDRL embeds structural market constraints directly into DRL pipeline, offering improved adaptability, robustness, and interpretability in complex financial environments. Upper-lower coordination, dynamic clustering, and capital allocation are essential for robustness.

Abstract: Portfolio optimization in non-stationary markets is challenging due to regime shifts, dynamic correlations, and the limited interpretability of deep reinforcement learning (DRL) policies. We propose a Segmented Allocation with Momentum-Adjusted Utility for Multi-agent Portfolio Management via Hierarchical Deep Reinforcement Learning (SAMP-HDRL). The framework first applies dynamic asset grouping to partition the market into high-quality and ordinary subsets. An upper-level agent extracts global market signals, while lower-level agents perform intra-group allocation under mask constraints. A utility-based capital allocation mechanism integrates risky and risk-free assets, ensuring coherent coordination between global and local decisions. backtests across three market regimes (2019–2021) demonstrate that SAMP-HDRL consistently outperforms nine traditional baselines and nine DRL benchmarks under volatile and oscillating conditions. Compared with the strongest baseline, our method achieves at least 5% higher Return, 5% higher Sharpe ratio, 5% higher Sortino ratio, and 2% higher Omega ratio, with substantially larger gains observed in turbulent markets. Ablation studies confirm that upper–lower coordination, dynamic clustering, and capital allocation are indispensable to robustness. SHAP-based interpretability further reveals a complementary ``diversified + concentrated’’ mechanism across agents, providing transparent insights into decision-making. Overall, SAMP-HDRL embeds structural market constraints directly into the DRL pipeline, offering improved adaptability, robustness, and interpretability in complex financial environments.

[405] HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery

Yaping Zhang, Qixuan Zhang, Xingquan Zhang, Zhiyuan Chen, Wenwen Zhuang, Yupu Liang, Lu Xiang, Yang Zhao, Jiajun Zhang, Yu Zhou, Chengqing Zong

Main category: cs.AI

TL;DR: HiSciBench is a hierarchical benchmark evaluating foundation models across five levels of scientific workflow, from basic literacy to creative discovery, across six disciplines with multimodal support.

Details

Motivation: Existing scientific AI benchmarks are fragmented, focusing on narrow tasks rather than reflecting the hierarchical, multi-disciplinary nature of real scientific inquiry. There's a need for comprehensive evaluation covering the complete scientific workflow.

Method: Created HiSciBench with 8,735 curated instances across six scientific disciplines (math, physics, chemistry, biology, geography, astronomy). Features five hierarchical levels: Scientific Literacy, Literature Parsing, Literature-based QA, Literature Review Generation, and Scientific Discovery. Supports multimodal inputs (text, equations, figures, tables) and cross-lingual evaluation.

Result: Evaluation of leading models (GPT-5, DeepSeek-R1, multimodal systems) shows substantial performance gaps: up to 69% accuracy on basic literacy tasks (L1), but declining sharply to 25% on discovery-level challenges (L5). Models struggle with higher-level scientific reasoning.

Conclusion: HiSciBench establishes a new standard for evaluating scientific intelligence, providing an integrated, dependency-aware framework for detailed diagnosis of model capabilities across scientific reasoning stages. The benchmark offers actionable insights for developing more capable and reliable scientific AI models.

Abstract: The rapid advancement of large language models (LLMs) and multimodal foundation models has sparked growing interest in their potential for scientific research. However, scientific intelligence encompasses a broad spectrum of abilities ranging from understanding fundamental knowledge to conducting creative discovery, and existing benchmarks remain fragmented. Most focus on narrow tasks and fail to reflect the hierarchical and multi-disciplinary nature of real scientific inquiry. We introduce \textbf{HiSciBench}, a hierarchical benchmark designed to evaluate foundation models across five levels that mirror the complete scientific workflow: \textit{Scientific Literacy} (L1), \textit{Literature Parsing} (L2), \textit{Literature-based Question Answering} (L3), \textit{Literature Review Generation} (L4), and \textit{Scientific Discovery} (L5). HiSciBench contains 8,735 carefully curated instances spanning six major scientific disciplines, including mathematics, physics, chemistry, biology, geography, and astronomy, and supports multimodal inputs including text, equations, figures, and tables, as well as cross-lingual evaluation. Unlike prior benchmarks that assess isolated abilities, HiSciBench provides an integrated, dependency-aware framework that enables detailed diagnosis of model capabilities across different stages of scientific reasoning. Comprehensive evaluations of leading models, including GPT-5, DeepSeek-R1, and several multimodal systems, reveal substantial performance gaps: while models achieve up to 69% accuracy on basic literacy tasks, performance declines sharply to 25% on discovery-level challenges. HiSciBench establishes a new standard for evaluating scientific Intelligence and offers actionable insights for developing models that are not only more capable but also more reliable. The benchmark will be publicly released to facilitate future research.

[406] Multi-agent Self-triage System with Medical Flowcharts

Yujia Liu, Sophia Yu, Hongyue Jin, Jessica Wen, Alexander Qian, Terrence Lee, Mattheus Ramsis, Gi Won Choi, Lianhui Qin, Xin Liu, Edward J. Wang

Main category: cs.AI

TL;DR: A conversational self-triage system that guides LLMs with 100 clinically validated flowcharts from the American Medical Association achieves high accuracy in retrieval and navigation for patient decision support.

Details

Motivation: Online health resources and LLMs are increasingly used for medical decision-making but have limitations in accuracy, transparency, and susceptibility to unverified information. There's a need for reliable, structured AI-assisted self-triage systems.

Method: A multi-agent framework with retrieval, decision, and chat agents guides LLMs using 100 clinically validated AMA flowcharts. The system identifies relevant flowcharts, interprets patient responses, and delivers personalized recommendations. Evaluation used synthetic datasets of simulated conversations.

Result: The system achieved 95.29% top-3 accuracy in flowchart retrieval (N=2,000) and 99.10% accuracy in flowchart navigation across varied conversational styles and conditions (N=37,200).

Conclusion: Combining free-text interaction flexibility with standardized clinical protocols demonstrates feasibility of transparent, accurate, generalizable AI-assisted self-triage, potentially supporting informed patient decision-making and improving healthcare resource utilization.

Abstract: Online health resources and large language models (LLMs) are increasingly used as a first point of contact for medical decision-making, yet their reliability in healthcare remains limited by low accuracy, lack of transparency, and susceptibility to unverified information. We introduce a proof-of-concept conversational self-triage system that guides LLMs with 100 clinically validated flowcharts from the American Medical Association, providing a structured and auditable framework for patient decision support. The system leverages a multi-agent framework consisting of a retrieval agent, a decision agent, and a chat agent to identify the most relevant flowchart, interpret patient responses, and deliver personalized, patient-friendly recommendations, respectively. Performance was evaluated at scale using synthetic datasets of simulated conversations. The system achieved 95.29% top-3 accuracy in flowchart retrieval (N=2,000) and 99.10% accuracy in flowchart navigation across varied conversational styles and conditions (N=37,200). By combining the flexibility of free-text interaction with the rigor of standardized clinical protocols, this approach demonstrates the feasibility of transparent, accurate, and generalizable AI-assisted self-triage, with potential to support informed patient decision-making while improving healthcare resource utilization.

[407] Geometric Structural Knowledge Graph Foundation Model

Ling Xin, Mojtaba Nayyeri, Zahra Makki Nayeri, Steffen Staab

Main category: cs.AI

TL;DR: Gamma introduces multi-head geometric attention with diverse algebraic transformations (real, complex, split-complex, dual) for knowledge graph reasoning, outperforming Ultra by 5.5% in inductive link prediction.

Details

Motivation: Existing structural knowledge graph foundation models like Ultra rely on single relational transformations in message passing, which limits expressiveness and fails to capture diverse relational and structural patterns across different graphs.

Method: Proposes Gamma with multi-head geometric attention that replaces single relational transformation with multiple parallel ones (real, complex, split-complex, dual number based transformations). Uses relational conditioned attention fusion mechanism with lightweight gating and entropy regularization to adaptively fuse transformations at link level.

Result: Comprehensive experiments on 56 diverse knowledge graphs show Gamma consistently outperforms Ultra in zero-shot inductive link prediction: 5.5% improvement in mean reciprocal rank on inductive benchmarks and 4.4% improvement across all benchmarks.

Conclusion: Gamma’s combination of complementary geometric representations through multi-head geometric attention significantly improves knowledge graph reasoning expressiveness and performance over single-transformation approaches.

Abstract: Structural knowledge graph foundation models aim to generalize reasoning to completely new graphs with unseen entities and relations. A key limitation of existing approaches like Ultra is their reliance on a single relational transformation (e.g., element-wise multiplication) in message passing, which can constrain expressiveness and fail to capture diverse relational and structural patterns exhibited on diverse graphs. In this paper, we propose Gamma, a novel foundation model that introduces multi-head geometric attention to knowledge graph reasoning. Gamma replaces the single relational transformation with multiple parallel ones, including real, complex, split-complex, and dual number based transformations, each designed to model different relational structures. A relational conditioned attention fusion mechanism then adaptively fuses them at link level via a lightweight gating with entropy regularization, allowing the model to robustly emphasize the most appropriate relational bias for each triple pattern. We present a full formalization of these algebraic message functions and discuss how their combination increases expressiveness beyond any single space. Comprehensive experiments on 56 diverse knowledge graphs demonstrate that Gamma consistently outperforms Ultra in zero-shot inductive link prediction, with a 5.5% improvement in mean reciprocal rank on the inductive benchmarks and a 4.4% improvement across all benchmarks, highlighting benefits from complementary geometric representations.

[408] Multimodal Fact-Checking: An Agent-based Approach

Danni Xu, Shaojing Fan, Xuanang Cheng, Mohan Kankanhalli

Main category: cs.AI

TL;DR: RW-Post dataset provides real-world multimodal misinformation with annotated reasoning and evidence, enabling AgentFact framework to improve fact-checking accuracy and interpretability through collaborative agent workflow.

Details

Motivation: Existing multimodal fact-checking systems have limited reasoning and shallow evidence utilization due to lack of dedicated datasets with complete real-world misinformation instances, annotated reasoning processes, and verifiable evidence.

Method: Introduces RW-Post dataset with real-world multimodal claims aligned with original social media posts, plus detailed reasoning and evidence extracted via LLM-assisted pipeline. Proposes AgentFact framework with five specialized agents for strategy planning, evidence retrieval, visual analysis, reasoning, and explanation generation in iterative workflow.

Result: Synergy between RW-Post dataset and AgentFact framework substantially improves both accuracy and interpretability of multimodal fact-checking in extensive experiments.

Conclusion: The combination of high-quality explainable dataset (RW-Post) and agent-based framework (AgentFact) addresses key limitations in multimodal fact-checking by enabling comprehensive verification and systematic evidence analysis.

Abstract: The rapid spread of multimodal misinformation poses a growing challenge for automated fact-checking systems. Existing approaches, including large vision language models (LVLMs) and deep multimodal fusion methods, often fall short due to limited reasoning and shallow evidence utilization. A key bottleneck is the lack of dedicated datasets that provide complete real-world multimodal misinformation instances accompanied by annotated reasoning processes and verifiable evidence. To address this limitation, we introduce RW-Post, a high-quality and explainable dataset for real-world multimodal fact-checking. RW-Post aligns real-world multimodal claims with their original social media posts, preserving the rich contextual information in which the claims are made. In addition, the dataset includes detailed reasoning and explicitly linked evidence, which are derived from human written fact-checking articles via a large language model assisted extraction pipeline, enabling comprehensive verification and explanation. Building upon RW-Post, we propose AgentFact, an agent-based multimodal fact-checking framework designed to emulate the human verification workflow. AgentFact consists of five specialized agents that collaboratively handle key fact-checking subtasks, including strategy planning, high-quality evidence retrieval, visual analysis, reasoning, and explanation generation. These agents are orchestrated through an iterative workflow that alternates between evidence searching and task-aware evidence filtering and reasoning, facilitating strategic decision-making and systematic evidence analysis. Extensive experimental results demonstrate that the synergy between RW-Post and AgentFact substantially improves both the accuracy and interpretability of multimodal fact-checking.

[409] Problems With Large Language Models for Learner Modelling: Why LLMs Alone Fall Short for Responsible Tutoring in K–12 Education

Danial Hooshyar, Yeongwook Yang, Gustav Šíř, Tommi Kärkkäinen, Raija Hämäläinen, Mutlu Cukurova, Roger Azevedo

Main category: cs.AI

TL;DR: LLM-based tutors can’t replace traditional learner modeling for adaptive instruction in K-12 education; deep knowledge tracing outperforms LLMs in accuracy, reliability, and temporal coherence of knowledge assessment.

Details

Motivation: Address misconceptions that generative models can replace traditional learner modeling in high-risk K-12 education settings, and investigate critical issues with LLM-based tutors' ability to assess learners' evolving knowledge over time.

Method: Comparative study of deep knowledge tracing (DKT) model vs. widely used LLM (zero-shot and fine-tuned) using large open-access dataset, evaluating accuracy, reliability, and temporal coherence of knowledge assessment.

Result: DKT achieved highest discrimination performance (AUC = 0.83) on next-step correctness prediction, consistently outperforming LLM across settings. Fine-tuning improved LLM’s AUC by ~8% but remained 6% below DKT. DKT maintained stable mastery updates while LLM variants showed substantial temporal weaknesses despite requiring 198 hours of high-compute training.

Conclusion: LLMs alone are unlikely to match established intelligent tutoring systems; responsible tutoring requires hybrid frameworks that incorporate learner modeling rather than relying solely on generative models.

Abstract: The rapid rise of large language model (LLM)-based tutors in K–12 education has fostered a misconception that generative models can replace traditional learner modelling for adaptive instruction. This is especially problematic in K–12 settings, which the EU AI Act classifies as high-risk domain requiring responsible design. Motivated by these concerns, this study synthesises evidence on limitations of LLM-based tutors and empirically investigates one critical issue: the accuracy, reliability, and temporal coherence of assessing learners’ evolving knowledge over time. We compare a deep knowledge tracing (DKT) model with a widely used LLM, evaluated zero-shot and fine-tuned, using a large open-access dataset. Results show that DKT achieves the highest discrimination performance (AUC = 0.83) on next-step correctness prediction and consistently outperforms the LLM across settings. Although fine-tuning improves the LLM’s AUC by approximately 8% over the zero-shot baseline, it remains 6% below DKT and produces higher early-sequence errors, where incorrect predictions are most harmful for adaptive support. Temporal analyses further reveal that DKT maintains stable, directionally correct mastery updates, whereas LLM variants exhibit substantial temporal weaknesses, including inconsistent and wrong-direction updates. These limitations persist despite the fine-tuned LLM requiring nearly 198 hours of high-compute training, far exceeding the computational demands of DKT. Our qualitative analysis of multi-skill mastery estimation further shows that, even after fine-tuning, the LLM produced inconsistent mastery trajectories, while DKT maintained smooth and coherent updates. Overall, the findings suggest that LLMs alone are unlikely to match the effectiveness of established intelligent tutoring systems, and that responsible tutoring requires hybrid frameworks that incorporate learner modelling.

[410] The Reward Model Selection Crisis in Personalized Alignment

Fady Rezk, Yuangang Pan, Chuan-Sheng Foo, Xun Xu, Nancy Chen, Henry Gouk, Timothy Hospedales

Main category: cs.AI

TL;DR: Standard reward model accuracy fails to predict deployment performance for personalized alignment; policy accuracy and behavioral evaluation reveal complete decoupling between preference ranking and actual generation quality.

Details

Motivation: Current personalized alignment research focuses on improving reward model accuracy, assuming better preference ranking leads to better personalized behavior. However, deployment constraints require inference-time adaptation via reward-guided decoding, creating a critical need for reward models that effectively guide token-level generation decisions, not just rank preferences accurately.

Method: Introduces policy accuracy metric to quantify whether reward-guided decoding scoring functions correctly discriminate between preferred and dispreferred responses. Creates Pref-LaMP benchmark with ground-truth user completions for direct behavioral evaluation. Systematically evaluates across three datasets, comparing reward-guided methods with in-context learning approaches.

Result: RM accuracy correlates only weakly with policy-level discrimination ability (Kendall’s tau = 0.08-0.31). Methods with 20-point RM accuracy differences produce almost identical output quality. Even high-discrimination methods fail to generate behaviorally aligned responses. Simple in-context learning dominates all reward-guided methods for models >3B parameters, achieving 3-5 point ROUGE-1 gains over best reward method at 7B scale.

Conclusion: The field optimizes proxy metrics (RM accuracy) that fail to predict deployment performance and do not translate preferences into real behavioral adaptation under deployment constraints. There’s a complete decoupling between discrimination and generation in personalized alignment.

Abstract: Personalized alignment from preference data has focused primarily on improving reward model (RM) accuracy, with the implicit assumption that better preference ranking translates to better personalized behavior. However, in deployment, computational constraints necessitate inference-time adaptation via reward-guided decoding (RGD) rather than per-user policy fine-tuning. This creates a critical but overlooked requirement: reward models must not only rank preferences accurately but also effectively guide token-level generation decisions. We demonstrate that standard RM accuracy fails catastrophically as a selection criterion for deployment-ready personalized alignment. Through systematic evaluation across three datasets, we introduce policy accuracy, a metric quantifying whether RGD scoring functions correctly discriminate between preferred and dispreferred responses. We show that RM accuracy correlates only weakly with this policy-level discrimination ability (Kendall’s tau = 0.08–0.31). More critically, we introduce Pref-LaMP, the first personalized alignment benchmark with ground-truth user completions, enabling direct behavioral evaluation without circular reward-based metrics. On Pref-LaMP, we expose a complete decoupling between discrimination and generation: methods with 20-point RM accuracy differences produce almost identical output quality, and even methods achieving high discrimination fail to generate behaviorally aligned responses. Finally, simple in-context learning (ICL) dominates all reward-guided methods for models > 3B parameters, achieving 3-5 point ROUGE-1 gains over the best reward method at 7B scale. These findings show that the field optimizes proxy metrics that fail to predict deployment performance and do not translate preferences into real behavioral adaptation under deployment constraints.

[411] Benchmark Success, Clinical Failure: When Reinforcement Learning Optimizes for Benchmarks, Not Patients

Armin Berger, Manuela Bergau, Helen Schneider, Saad Ahmad, Tom Anglim Lagones, Gianluca Brugnara, Martha Foltyn-Dumitru, Kai Schlamp, Philipp Vollmuth, Rafet Sifa

Main category: cs.AI

TL;DR: ChexReason is a resource-efficient vision-language model for medical imaging that uses RL training but reveals a generalization paradox: RL improves in-distribution performance but harms cross-dataset transferability, suggesting supervised fine-tuning may be better for clinical robustness.

Details

Motivation: To explore resource-constrained RL applications for LLMs in medical imaging, where current methods are underexplored despite RL advances in reasoning tasks.

Method: ChexReason vision-language model trained via R1-style methodology (SFT followed by GRPO) using only 2,000 SFT samples, 1,000 RL samples, and a single A100 GPU.

Result: GRPO recovers in-distribution performance (23% improvement on CheXpert, macro-F1 = 0.346) but degrades cross-dataset transferability (19% drop on NIH). SFT checkpoint uniquely improves on NIH before optimization, showing teacher-guided reasoning captures more institution-agnostic features.

Conclusion: The generalization paradox suggests curated supervised fine-tuning may outperform aggressive RL for clinical deployment requiring robustness across diverse populations, as the issue stems from RL paradigm rather than model scale.

Abstract: Recent Reinforcement Learning (RL) advances for Large Language Models (LLMs) have improved reasoning tasks, yet their resource-constrained application to medical imaging remains underexplored. We introduce ChexReason, a vision-language model trained via R1-style methodology (SFT followed by GRPO) using only 2,000 SFT samples, 1,000 RL samples, and a single A100 GPU. Evaluations on CheXpert and NIH benchmarks reveal a fundamental tension: GRPO recovers in-distribution performance (23% improvement on CheXpert, macro-F1 = 0.346) but degrades cross-dataset transferability (19% drop on NIH). This mirrors high-resource models like NV-Reason-CXR-3B, suggesting the issue stems from the RL paradigm rather than scale. We identify a generalization paradox where the SFT checkpoint uniquely improves on NIH before optimization, indicating teacher-guided reasoning captures more institution-agnostic features. Furthermore, cross-model comparisons show structured reasoning scaffolds benefit general-purpose VLMs but offer minimal gain for medically pre-trained models. Consequently, curated supervised fine-tuning may outperform aggressive RL for clinical deployment requiring robustness across diverse populations.

[412] InSPO: Unlocking Intrinsic Self-Reflection for LLM Preference Optimization

Yu Li, Tian Lan, Zhengling Qi

Main category: cs.AI

TL;DR: Proposes Intrinsic Self-reflective Preference Optimization (q) to address DPO limitations by enabling LLMs to consider alternative responses during training, improving alignment without architectural changes.

Details

Motivation: Identifies two fundamental limitations of DPO: 1) optimal policy depends on arbitrary modeling choices (scalarization function, reference policy), leading to parameterization artifacts rather than true preferences; 2) treating response generation in isolation fails to leverage comparative information in pairwise data, leaving model's capacity for intrinsic self-reflection untapped.

Method: Proposes Intrinsic Self-reflective Preference Optimization (q) that derives a globally optimal policy conditioning on both context and alternative responses, ensuring invariance to scalarization and reference choices while serving as a plug-and-play enhancement without architectural changes or inference overhead.

Result: Experiments demonstrate consistent improvements in win rates and length-controlled metrics, validating that unlocking self-reflection yields more robust, human-aligned LLMs.

Conclusion: The proposed method addresses fundamental limitations of DPO/RLHF by enabling models to leverage comparative information through self-reflection, resulting in more robust alignment without additional inference costs.

Abstract: Direct Preference Optimization (DPO) and its variants have become standard for aligning Large Language Models due to their simplicity and offline stability. However, we identify two fundamental limitations. First, the optimal policy depends on arbitrary modeling choices (scalarization function, reference policy), yielding behavior reflecting parameterization artifacts rather than true preferences. Second, treating response generation in isolation fails to leverage comparative information in pairwise data, leaving the model’s capacity for intrinsic self-reflection untapped. To address it, we propose Intrinsic Self-reflective Preference Optimization (\q), deriving a globally optimal policy conditioning on both context and alternative responses. We prove this formulation superior to DPO/RLHF while guaranteeing invariance to scalarization and reference choices. \q~serves as a plug-and-play enhancement without architectural changes or inference overhead. Experiments demonstrate consistent improvements in win rates and length-controlled metrics, validating that unlocking self-reflection yields more robust, human-aligned LLMs.

[413] Why We Need a New Framework for Emotional Intelligence in AI

Max Parks, Kheli Atluru, Meera Vinod, Mike Kuniavsky, Jud Brewer, Sean White, Sarah Adler, Wendy Ju

Main category: cs.AI

TL;DR: Current EI evaluation frameworks for AI need refinement as they don’t adequately measure relevant aspects, mixing human-specific phenomenological components with AI-applicable emotion sensing/response capabilities.

Details

Motivation: Existing frameworks for evaluating emotional intelligence in AI systems are inadequate because they don't properly distinguish between human-specific phenomenological aspects of EI (which AI lacks) and functional aspects that AI can perform (like sensing emotional states and responding appropriately). Current benchmarks lack solid theoretical foundations about emotion and EI.

Method: 1) Review different theories of emotion and general EI, evaluating applicability to artificial systems. 2) Critically evaluate available benchmark frameworks, identifying shortcomings based on the developed account of EI. 3) Outline options for improving evaluation strategies to address identified shortcomings.

Result: The paper identifies that current EI evaluation frameworks for AI need refinement by distinguishing between human-specific phenomenological components (irrelevant for AI) and functional capabilities (relevant for AI). It critically analyzes existing benchmarks and proposes improved evaluation strategies.

Conclusion: EI evaluation in AI requires frameworks that focus on AI-relevant aspects (emotion sensing, explanation, appropriate response, adaptation) while excluding human-specific phenomenological components. Improved evaluation strategies are needed with better theoretical foundations about emotion and EI.

Abstract: In this paper, we develop the position that current frameworks for evaluating emotional intelligence (EI) in artificial intelligence (AI) systems need refinement because they do not adequately or comprehensively measure the various aspects of EI relevant in AI. Human EI often involves a phenomenological component and a sense of understanding that artificially intelligent systems lack; therefore, some aspects of EI are irrelevant in evaluating AI systems. However, EI also includes an ability to sense an emotional state, explain it, respond appropriately, and adapt to new contexts (e.g., multicultural), and artificially intelligent systems can do such things to greater or lesser degrees. Several benchmark frameworks specialize in evaluating the capacity of different AI models to perform some tasks related to EI, but these often lack a solid foundation regarding the nature of emotion and what it is to be emotionally intelligent. In this project, we begin by reviewing different theories about emotion and general EI, evaluating the extent to which each is applicable to artificial systems. We then critically evaluate the available benchmark frameworks, identifying where each falls short in light of the account of EI developed in the first section. Lastly, we outline some options for improving evaluation strategies to avoid these shortcomings in EI evaluation in AI systems.

[414] From Model Choice to Model Belief: Establishing a New Measure for LLM-Based Research

Hongshen Sun, Juanjuan Zhang

Main category: cs.AI

TL;DR: LLMs are inefficiently used by treating outputs as single data points. Model belief, derived from token probabilities, captures LLM’s belief distribution over alternatives in one run, providing more statistically efficient estimates than model choice.

Details

Motivation: Current practices using LLM-generated data are inefficient because they treat LLM outputs as single data points, underutilizing the probabilistic information inherent in LLMs. There's a need to extract more information from LLM-generated data.

Method: Introduces “model belief” - a measure derived from an LLM’s token-level probabilities that captures the model’s belief distribution over choice alternatives in a single generation run. The authors prove theoretical properties and demonstrate through a demand estimation study where an LLM simulates consumer responses to different prices.

Result: Model belief is asymptotically equivalent to mean model choices but forms a more statistically efficient estimator with lower variance and faster convergence rate. In practical settings with limited runs, model belief explains and predicts ground-truth model choice better than model choice itself, reducing computation needed for accurate estimates by roughly 20x.

Conclusion: Model belief should be the default measure to extract more information from LLM-generated data, offering significant efficiency gains over traditional model choice approaches.

Abstract: Large language models (LLMs) are increasingly used to simulate human behavior, but common practices to use LLM-generated data are inefficient. Treating an LLM’s output (“model choice”) as a single data point underutilizes the information inherent to the probabilistic nature of LLMs. This paper introduces and formalizes “model belief,” a measure derived from an LLM’s token-level probabilities that captures the model’s belief distribution over choice alternatives in a single generation run. The authors prove that model belief is asymptotically equivalent to the mean of model choices (a non-trivial property) but forms a more statistically efficient estimator, with lower variance and a faster convergence rate. Analogous properties are shown to hold for smooth functions of model belief and model choice often used in downstream applications. The authors demonstrate the performance of model belief through a demand estimation study, where an LLM simulates consumer responses to different prices. In practical settings with limited numbers of runs, model belief explains and predicts ground-truth model choice better than model choice itself, and reduces the computation needed to reach sufficiently accurate estimates by roughly a factor of 20. The findings support using model belief as the default measure to extract more information from LLM-generated data.

[415] TCEval: Using Thermal Comfort to Assess Cognitive and Perceptual Abilities of AI

Jingming Li

Main category: cs.AI

TL;DR: TCEval is a novel evaluation framework that uses thermal comfort scenarios to assess AI’s cross-modal reasoning, causal association, and adaptive decision-making capabilities, revealing current LLMs have foundational reasoning but lack precise causal understanding.

Details

Motivation: There's a critical gap in LLM task-specific benchmarks. Thermal comfort, involving complex environmental factors and personal perceptions, serves as an ideal paradigm for evaluating real-world cognitive capabilities of AI systems.

Method: Initialize LLM agents with virtual personality attributes, guide them to generate clothing insulation selections and thermal comfort feedback, and validate outputs against ASHRAE Global Database and Chinese Thermal Comfort Database.

Result: LLM agent feedback has limited exact alignment with humans but directional consistency improves with 1 PMV tolerance. LLM-generated PMV distributions diverge markedly from human data, and agents perform near-randomly in discrete thermal comfort classification.

Conclusion: TCEval is feasible as an ecologically valid Cognitive Turing Test, showing current LLMs have foundational cross-modal reasoning but lack precise causal understanding of nonlinear relationships in thermal comfort. It shifts AI evaluation focus to embodied, context-aware perception and decision-making.

Abstract: A critical gap exists in LLM task-specific benchmarks. Thermal comfort, a sophisticated interplay of environmental factors and personal perceptions involving sensory integration and adaptive decision-making, serves as an ideal paradigm for evaluating real-world cognitive capabilities of AI systems. To address this, we propose TCEval, the first evaluation framework that assesses three core cognitive capacities of AI, cross-modal reasoning, causal association, and adaptive decision-making, by leveraging thermal comfort scenarios and large language model (LLM) agents. The methodology involves initializing LLM agents with virtual personality attributes, guiding them to generate clothing insulation selections and thermal comfort feedback, and validating outputs against the ASHRAE Global Database and Chinese Thermal Comfort Database. Experiments on four LLMs show that while agent feedback has limited exact alignment with humans, directional consistency improves significantly with a 1 PMV tolerance. Statistical tests reveal that LLM-generated PMV distributions diverge markedly from human data, and agents perform near-randomly in discrete thermal comfort classification. These results confirm the feasibility of TCEval as an ecologically valid Cognitive Turing Test for AI, demonstrating that current LLMs possess foundational cross-modal reasoning ability but lack precise causal understanding of the nonlinear relationships between variables in thermal comfort. TCEval complements traditional benchmarks, shifting AI evaluation focus from abstract task proficiency to embodied, context-aware perception and decision-making, offering valuable insights for advancing AI in human-centric applications like smart buildings.

[416] Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control

Yoonpyo Lee, Kazuma Kobayashi, Sai Puppala, Sajedul Talukder, Seid Koric, Souvik Chakraborty, Syed Bahauddin Alam

Main category: cs.AI

TL;DR: The paper introduces Agentic Physical AI - compact language models trained with physics-based validation rather than perceptual inference, achieving stable control behavior through dataset scaling and autonomous strategy selection.

Details

Motivation: Current AI foundation models for physical systems fail at control interfaces, achieving only 50-53% accuracy on basic physics tasks. They preserve semantic plausibility but violate physical constraints, which is a structural limitation rather than scaling issue. Perception-centric architectures optimize parameter-space imitation while control requires outcome-space guarantees over executed actions.

Method: Train compact 360M-parameter language models as Agentic Physical AI using physics-based validation rather than perceptual inference. Use synthetic reactor control scenarios, scaling dataset from 10^3 to 10^5 examples. Models autonomously reject training distribution strategies and concentrate on optimal execution patterns.

Result: Dataset scaling induces sharp phase transition: small systems show high-variance imitation with catastrophic risk, while large models achieve >500x variance reduction and stable execution. Despite balanced exposure to 4 actuation families, model autonomously rejects ~70% of training distribution and concentrates 95% runtime on single-bank strategy. Learned representations transfer across physics domains and input modalities without architectural changes.

Conclusion: Agentic Physical AI offers a fundamentally different pathway for domain-specific foundation models, where physics-based validation drives policy optimization instead of perceptual inference. This approach enables stable, reliable control behavior with transferable representations, addressing the structural limitations of perception-centric architectures for safety-critical physical systems.

Abstract: The prevailing paradigm in AI for physical systems, scaling general-purpose foundation models toward universal multimodal reasoning, confronts a fundamental barrier at the control interface. Recent benchmarks show that even frontier vision-language models achieve only 50-53% accuracy on basic quantitative physics tasks, behaving as approximate guessers that preserve semantic plausibility while violating physical constraints. This input unfaithfulness is not a scaling deficiency but a structural limitation. Perception-centric architectures optimize parameter-space imitation, whereas safety-critical control demands outcome-space guarantees over executed actions. Here, we present a fundamentally different pathway toward domain-specific foundation models by introducing compact language models operating as Agentic Physical AI, in which policy optimization is driven by physics-based validation rather than perceptual inference. We train a 360-million-parameter model on synthetic reactor control scenarios, scaling the dataset from 10^3 to 10^5 examples. This induces a sharp phase transition absent in general-purpose models. Small-scale systems exhibit high-variance imitation with catastrophic tail risk, while large-scale models undergo variance collapse exceeding 500x reduction, stabilizing execution-level behavior. Despite balanced exposure to four actuation families, the model autonomously rejects approximately 70% of the training distribution and concentrates 95% of runtime execution on a single-bank strategy. Learned representations transfer across distinct physics and continuous input modalities without architectural modification.

[417] On Conformant Planning and Model-Checking of $\exists^\forall^$ Hyperproperties

Raven Beutner, Bernd Finkbeiner

Main category: cs.AI

TL;DR: The paper establishes a formal connection between conformant planning (finding plans robust to non-deterministic effects) and model-checking of ∃∀ hyperproperties (properties relating multiple execution traces).

Details

Motivation: To bridge two seemingly distinct problems from planning and verification communities: conformant planning (from AI planning) and hyperproperty model-checking (from formal verification), showing their fundamental relationship.

Method: 1) Provide an efficient reduction from hyperproperty model-checking instances to conformant planning instances with soundness and completeness proofs. 2) Show the converse direction that every conformant planning problem is itself a hyperproperty model-checking task.

Result: Established a bidirectional connection: ∃∀ hyperproperty model-checking can be reduced to conformant planning, and vice versa, revealing these problems are essentially equivalent in a formal sense.

Conclusion: Conformant planning and ∃∀ hyperproperty model-checking are closely related problems, enabling cross-fertilization of techniques and insights between planning and verification communities.

Abstract: We study the connection of two problems within the planning and verification community: Conformant planning and model-checking of hyperproperties. Conformant planning is the task of finding a sequential plan that achieves a given objective independent of non-deterministic action effects during the plan’s execution. Hyperproperties are system properties that relate multiple execution traces of a system and, e.g., capture information-flow and fairness policies. In this paper, we show that model-checking of $\exists^\forall^$ hyperproperties is closely related to the problem of computing a conformant plan. Firstly, we show that we can efficiently reduce a hyperproperty model-checking instance to a conformant planning instance, and prove that our encoding is sound and complete. Secondly, we establish the converse direction: Every conformant planning problem is, itself, a hyperproperty model-checking task.

[418] CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning Under Partial Observations

Huan-ang Gao, Zikang Zhang, Tianwei Luo, Kaisen Yang, Xinzhe Juan, Jiahao Qiu, Tianxing Chen, Bingxiang He, Hao Zhao, Hao Zhou, Shilong Liu, Mengdi Wang

Main category: cs.AI

TL;DR: CubeBench is a new benchmark using Rubik’s Cube to evaluate LLM agents’ spatial reasoning, mental simulation, and active exploration capabilities, revealing severe limitations in long-horizon planning with 0% success rates.

Details

Motivation: LLM agents struggle with physical-world deployment due to challenges in forming spatial mental models, specifically lacking spatial reasoning, long-horizon state tracking via mental simulation, and active exploration under partial observation.

Method: Introduces CubeBench, a generative benchmark centered on Rubik’s Cube with a three-tiered diagnostic framework: 1) foundational state tracking with full symbolic information, 2) intermediate tasks, and 3) active exploration with only partial visual data. Also proposes diagnostic framework with external solver tools to isolate cognitive bottlenecks.

Result: Leading LLMs show critical limitations, including a uniform 0.00% pass rate on all long-horizon tasks, exposing fundamental failure in long-term planning. The benchmark successfully reveals specific cognitive bottlenecks in spatial reasoning and mental simulation.

Conclusion: CubeBench provides key insights to guide development of more physically-grounded intelligent agents by isolating and evaluating core cognitive challenges that hinder LLM agents’ transition to physical-world deployment.

Abstract: Large Language Model (LLM) agents, while proficient in the digital realm, face a significant gap in physical-world deployment due to the challenge of forming and maintaining a robust spatial mental model. We identify three core cognitive challenges hindering this transition: spatial reasoning, long-horizon state tracking via mental simulation, and active exploration under partial observation. To isolate and evaluate these faculties, we introduce CubeBench, a novel generative benchmark centered on the Rubik’s Cube. CubeBench uses a three-tiered diagnostic framework that progressively assesses agent capabilities, from foundational state tracking with full symbolic information to active exploration with only partial visual data. Our experiments on leading LLMs reveal critical limitations, including a uniform 0.00% pass rate on all long-horizon tasks, exposing a fundamental failure in long-term planning. We also propose a diagnostic framework to isolate these cognitive bottlenecks by providing external solver tools. By analyzing the failure modes, we provide key insights to guide the development of more physically-grounded intelligent agents.

[419] MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning

Jiawei Chen, Xintian Shen, Lihao Zheng, Zhenwei Shao, Hongyuan Zhang, Pengfei Yu, Xudong Rao, Ning Mao, Xiaobo Liu, Lian Wen, Chaoqun Du, Feng Gu, Wei He, Qizhen Li, Shanshan Li, Zide Liu, Jing Luo, Lifu Mu, Xuhao Pan, Chang Ren, Haoyi Sun, Qian Wang, Wei Wang, Hongfu Yang, Jiqing Zhan, Chunpeng Zhou, Zheng Zhou, Hao Ma, Tao Wei, Pan Zhou, Wei Chen

Main category: cs.AI

TL;DR: MindWatcher is a tool-integrated reasoning agent that combines interleaved thinking with multimodal chain-of-thought reasoning to autonomously invoke and coordinate diverse tools without human prompts, outperforming larger models through superior tool usage.

Details

Motivation: Traditional workflow-based agents have limited intelligence for real-world problems requiring tool invocation. There's a need for autonomous reasoning agents that can handle complex decision-making with multi-step interactions in external environments.

Method: MindWatcher integrates interleaved thinking (switching between thinking and tool calling) with multimodal chain-of-thought reasoning (manipulating images during reasoning). It uses automated data auditing/evaluation pipelines, curated training datasets, auxiliary reasoning tools, and a large-scale local image retrieval database covering eight categories.

Result: MindWatcher matches or exceeds performance of larger/recent models through superior tool invocation. The research also uncovered critical insights like genetic inheritance phenomenon in agentic RL. A benchmark called MWE-Bench was created for evaluation.

Conclusion: MindWatcher demonstrates that tool-integrated reasoning agents with interleaved thinking and multimodal CoT capabilities can effectively address broad-domain multimodal problems, with efficient training infrastructure enabling competitive performance despite smaller model size.

Abstract: Traditional workflow-based agents exhibit limited intelligence when addressing real-world problems requiring tool invocation. Tool-integrated reasoning (TIR) agents capable of autonomous reasoning and tool invocation are rapidly emerging as a powerful approach for complex decision-making tasks involving multi-step interactions with external environments. In this work, we introduce MindWatcher, a TIR agent integrating interleaved thinking and multimodal chain-of-thought (CoT) reasoning. MindWatcher can autonomously decide whether and how to invoke diverse tools and coordinate their use, without relying on human prompts or workflows. The interleaved thinking paradigm enables the model to switch between thinking and tool calling at any intermediate stage, while its multimodal CoT capability allows manipulation of images during reasoning to yield more precise search results. We implement automated data auditing and evaluation pipelines, complemented by manually curated high-quality datasets for training, and we construct a benchmark, called MindWatcher-Evaluate Bench (MWE-Bench), to evaluate its performance. MindWatcher is equipped with a comprehensive suite of auxiliary reasoning tools, enabling it to address broad-domain multimodal problems. A large-scale, high-quality local image retrieval database, covering eight categories including cars, animals, and plants, endows model with robust object recognition despite its small size. Finally, we design a more efficient training infrastructure for MindWatcher, enhancing training speed and hardware utilization. Experiments not only demonstrate that MindWatcher matches or exceeds the performance of larger or more recent models through superior tool invocation, but also uncover critical insights for agent training, such as the genetic inheritance phenomenon in agentic RL.

[420] The World Is Bigger! A Computationally-Embedded Perspective on the Big World Hypothesis

Alex Lewandowski, Adtiya A. Ramesh, Edan Meyer, Dale Schuurmans, Marlos C. Machado

Main category: cs.AI

TL;DR: The paper introduces a computationally-embedded perspective for continual learning where agents are inherently constrained by being embedded in their environment, proposes an “interactivity” objective to measure continual adaptation capability, and shows deep linear networks outperform nonlinear ones in sustaining interactivity.

Details

Motivation: Current continual learning formulations use explicit constraints that can be ad hoc and limit scalability. The paper aims to characterize a more natural setting where agents are inherently constrained by being computationally embedded in their environment, aligning with the "big world hypothesis" that the world is bigger than the agent.

Method: Introduces a computationally-embedded perspective representing agents as automata simulated within a universal computer, proves equivalence to agents interacting with partially observable Markov decision processes over infinite state-spaces, proposes an “interactivity” objective measuring continual adaptation, and develops a model-based RL algorithm for interactivity-seeking.

Result: Deep nonlinear networks struggle to sustain interactivity, while deep linear networks sustain higher interactivity as capacity increases, suggesting linear architectures may be better suited for continual learning in computationally-embedded settings.

Conclusion: The computationally-embedded perspective provides a principled foundation for continual learning where agents are inherently constrained by their environment, and interactivity serves as a meaningful objective for evaluating continual adaptation capability, with architectural choices significantly impacting performance.

Abstract: Continual learning is often motivated by the idea, known as the big world hypothesis, that “the world is bigger” than the agent. Recent problem formulations capture this idea by explicitly constraining an agent relative to the environment. These constraints lead to solutions in which the agent continually adapts to best use its limited capacity, rather than converging to a fixed solution. However, explicit constraints can be ad hoc, difficult to incorporate, and may limit the effectiveness of scaling up the agent’s capacity. In this paper, we characterize a problem setting in which an agent, regardless of its capacity, is constrained by being embedded in the environment. In particular, we introduce a computationally-embedded perspective that represents an embedded agent as an automaton simulated within a universal (formal) computer. Such an automaton is always constrained; we prove that it is equivalent to an agent that interacts with a partially observable Markov decision process over a countably infinite state-space. We propose an objective for this setting, which we call interactivity, that measures an agent’s ability to continually adapt its behaviour by learning new predictions. We then develop a model-based reinforcement learning algorithm for interactivity-seeking, and use it to construct a synthetic problem to evaluate continual learning capability. Our results show that deep nonlinear networks struggle to sustain interactivity, whereas deep linear networks sustain higher interactivity as capacity increases.

[421] AKG kernel Agent: A Multi-Agent Framework for Cross-Platform Kernel Synthesis

Jinye Du, Quan Yuan, Zuyao Zhang, Yanzhi Yi, Jiahui Hu, Wangyi Chen, Yiyang Zhu, Qishui Zheng, Wenxiang Zou, Xiangyu Chang, Zuohe Zheng, Zichun Ye, Chao Liu, Shanni Li, Renwei Zhang, Yiping Deng, Xinwei Hu, Xuefeng Jin, Jie Zhao

Main category: cs.AI

TL;DR: AKG kernel agent is a multi-agent AI system that automates kernel generation, migration, and performance tuning for modern AI workloads across multiple hardware platforms and domain-specific languages.

Details

Motivation: Modern AI models (LLMs, multimodal architectures, recommendation systems) with techniques like sparsity and quantization create computational challenges. Frequent hardware updates and diverse chip architectures require tailored kernel implementations, but manual optimization can't keep pace, creating a bottleneck in AI system development.

Method: AKG kernel agent is a multi-agent system that automates kernel generation, migration, and performance tuning. It supports multiple domain-specific languages (Triton, TileLang, CPP, CUDA-C) to target different hardware backends while maintaining correctness and portability. The system has modular design for rapid integration of new DSLs and hardware targets.

Result: When evaluated on KernelBench using Triton DSL across GPU and NPU backends, AKG kernel agent achieves an average speedup of 1.46× over PyTorch Eager baseline implementations.

Conclusion: AKG kernel agent effectively accelerates kernel development for modern AI workloads by automating kernel generation and optimization, addressing the critical bottleneck in AI system development caused by manual optimization limitations.

Abstract: Modern AI models demand high-performance computation kernels. The growing complexity of LLMs, multimodal architectures, and recommendation systems, combined with techniques like sparsity and quantization, creates significant computational challenges. Moreover, frequent hardware updates and diverse chip architectures further complicate this landscape, requiring tailored kernel implementations for each platform. However, manual optimization cannot keep pace with these demands, creating a critical bottleneck in AI system development. Recent advances in LLM code generation capabilities have opened new possibilities for automating kernel development. In this work, we propose AKG kernel agent (AI-driven Kernel Generator), a multi-agent system that automates kernel generation, migration, and performance tuning. AKG kernel agent is designed to support multiple domain-specific languages (DSLs), including Triton, TileLang, CPP, and CUDA-C, enabling it to target different hardware backends while maintaining correctness and portability. The system’s modular design allows rapid integration of new DSLs and hardware targets. When evaluated on KernelBench using Triton DSL across GPU and NPU backends, AKG kernel agent achieves an average speedup of 1.46$\times$ over PyTorch Eager baselines implementations, demonstrating its effectiveness in accelerating kernel development for modern AI workloads.

[422] Replay Failures as Successes: Sample-Efficient Reinforcement Learning for Instruction Following

Kongcheng Zhang, Qi Yao, Shunyu Liu, Wenjian Zhang, Min Cen, Yang Zhou, Wenkai Fang, Yiru Zhao, Baisheng Lai, Mingli Song

Main category: cs.AI

TL;DR: HiR is a sample-efficient RL framework for complex instruction following that replays failed attempts as successes based on satisfied constraints, enabling efficient optimization with binary rewards.

Details

Motivation: RL for aligning LLMs to follow instructions with constraints often fails because initial models struggle to generate responses satisfying all constraints, leading to sparse/indistinguishable rewards that impede learning.

Method: Hindsight instruction Replay (HiR) uses a select-then-rewrite strategy to replay failed attempts as successes based on constraints satisfied in hindsight, performing RL on both replayed and original samples with dual-preference learning at instruction- and response-level.

Result: Extensive experiments show HiR yields promising results across different instruction following tasks while requiring less computational budget.

Conclusion: HiR provides an effective sample-efficient RL framework for complex instruction following by leveraging hindsight replay and dual-preference learning with binary rewards.

Abstract: Reinforcement Learning (RL) has shown promise for aligning Large Language Models (LLMs) to follow instructions with various constraints. Despite the encouraging results, RL improvement inevitably relies on sampling successful, high-quality responses; however, the initial model often struggles to generate responses that satisfy all constraints due to its limited capabilities, yielding sparse or indistinguishable rewards that impede learning. In this work, we propose Hindsight instruction Replay (HiR), a novel sample-efficient RL framework for complex instruction following tasks, which employs a select-then-rewrite strategy to replay failed attempts as successes based on the constraints that have been satisfied in hindsight. We perform RL on these replayed samples as well as the original ones, theoretically framing the objective as dual-preference learning at both the instruction- and response-level to enable efficient optimization using only a binary reward signal. Extensive experiments demonstrate that the proposed HiR yields promising results across different instruction following tasks, while requiring less computational budget. Our code and dataset is available at https://github.com/sastpg/HIR.

[423] The Gaining Paths to Investment Success: Information-Driven LLM Graph Reasoning for Venture Capital Prediction

Haoyu Pei, Zhongyang Liu, Xiangyi Xiao, Xiaocong Du, Haipeng Zhang, Kunpeng Zhang, Suting Hong

Main category: cs.AI

TL;DR: MIRAGE-VC is a multi-perspective retrieval-augmented generation framework that predicts startup success by selecting high-value graph paths and fusing heterogeneous evidence through explicit reasoning.

Details

Motivation: VC investments have high failure rates with few outsized successes. Predicting startup success requires complex relational reasoning across company disclosures, investor track records, and network structures, which traditional ML/GNNs lack. LLMs offer reasoning but face modality mismatch with graphs, and existing graph-LLM methods focus on in-graph tasks while VC prediction is off-graph.

Method: MIRAGE-VC uses information-gain-driven path retrieval to iteratively select high-value neighbors, distilling investment networks into compact chains for explicit reasoning. A multi-agent architecture integrates three evidence streams via learnable gating based on company attributes, addressing path explosion and heterogeneous evidence fusion challenges.

Result: Under strict anti-leakage controls, MIRAGE-VC achieves +5.0% F1 and +16.6% PrecisionAt5 improvements, demonstrating effectiveness for VC prediction and potential for other off-graph tasks like recommendation and risk assessment.

Conclusion: The framework successfully addresses the core challenge of selecting graph paths that maximize predictor performance on external objectives while enabling step-by-step reasoning, providing interpretable investment theses for VC decision-making.

Abstract: Most venture capital (VC) investments fail, while a few deliver outsized returns. Accurately predicting startup success requires synthesizing complex relational evidence, including company disclosures, investor track records, and investment network structures, through explicit reasoning to form coherent, interpretable investment theses. Traditional machine learning and graph neural networks both lack this reasoning capability. Large language models (LLMs) offer strong reasoning but face a modality mismatch with graphs. Recent graph-LLM methods target in-graph tasks where answers lie within the graph, whereas VC prediction is off-graph: the target exists outside the network. The core challenge is selecting graph paths that maximize predictor performance on an external objective while enabling step-by-step reasoning. We present MIRAGE-VC, a multi-perspective retrieval-augmented generation framework that addresses two obstacles: path explosion (thousands of candidate paths overwhelm LLM context) and heterogeneous evidence fusion (different startups need different analytical emphasis). Our information-gain-driven path retriever iteratively selects high-value neighbors, distilling investment networks into compact chains for explicit reasoning. A multi-agent architecture integrates three evidence streams via a learnable gating mechanism based on company attributes. Under strict anti-leakage controls, MIRAGE-VC achieves +5.0% F1 and +16.6% PrecisionAt5, and sheds light on other off-graph prediction tasks such as recommendation and risk assessment. Code: https://anonymous.4open.science/r/MIRAGE-VC-323F.

[424] Why AI Safety Requires Uncertainty, Incomplete Preferences, and Non-Archimedean Utilities

Alessio Benavoli, Alessandro Facchini, Marco Zaffalon

Main category: cs.AI

TL;DR: The paper analyzes AI alignment through assistance and shutdown games, showing these require AI agents that can reason under uncertainty and handle incomplete/non-Archimedean preferences.

Details

Motivation: To ensure AI systems are aligned with human values and remain safe, the paper examines two key problems: AI assistance (helping humans maximize their unknown utility functions) and AI shutdown (ensuring safe shutdown behavior).

Method: The paper uses game-theoretic frameworks - AI assistance games and AI shutdown games - to analyze alignment challenges. It examines how AI agents must learn human preferences in assistance scenarios and handle shutdown scenarios safely.

Result: The analysis reveals that addressing AI alignment challenges requires agents capable of reasoning under uncertainty and handling both incomplete preferences (where not all alternatives can be compared) and non-Archimedean preferences (where some values are infinitely more important than others).

Conclusion: AI safety and alignment require sophisticated agents that can navigate uncertainty and complex preference structures, particularly in assistance and shutdown scenarios where human values must be learned and respected.

Abstract: How can we ensure that AI systems are aligned with human values and remain safe? We can study this problem through the frameworks of the AI assistance and the AI shutdown games. The AI assistance problem concerns designing an AI agent that helps a human to maximise their utility function(s). However, only the human knows these function(s); the AI assistant must learn them. The shutdown problem instead concerns designing AI agents that: shut down when a shutdown button is pressed; neither try to prevent nor cause the pressing of the shutdown button; and otherwise accomplish their task competently. In this paper, we show that addressing these challenges requires AI agents that can reason under uncertainty and handle both incomplete and non-Archimedean preferences.

[425] Divergent-Convergent Thinking in Large Language Models for Creative Problem Generation

Manh Hung Nguyen, Adish Singla

Main category: cs.AI

TL;DR: CreativeDC is a two-phase prompting method that improves LLM-generated educational problems by separating creative exploration from constraint satisfaction, significantly increasing diversity and novelty while maintaining utility.

Details

Motivation: LLMs suffer from the "Artificial Hivemind" effect, generating similar responses within and across models, leading to repetitive educational problems that harm diversity of thought for students.

Method: CreativeDC uses a two-phase prompting method inspired by Wallas’s creativity theory and Guilford’s divergent-convergent thinking framework. It explicitly scaffolds LLM reasoning into distinct phases: creative exploration (divergent thinking) followed by constraint satisfaction (convergent thinking).

Result: CreativeDC achieves significantly higher diversity and novelty compared to baselines while maintaining high utility. Scaling analysis shows it generates a larger effective number of distinct problems as more are sampled, increasing at a faster rate than baseline methods.

Conclusion: The proposed CreativeDC method successfully addresses the Artificial Hivemind effect in LLMs for educational problem generation, enabling more diverse and creative outputs through structured two-phase reasoning.

Abstract: Large language models (LLMs) have significant potential for generating educational questions and problems, enabling educators to create large-scale learning materials. However, LLMs are fundamentally limited by the ``Artificial Hivemind’’ effect, where they generate similar responses within the same model and produce homogeneous outputs across different models. As a consequence, students may be exposed to overly similar and repetitive LLM-generated problems, which harms diversity of thought. Drawing inspiration from Wallas’s theory of creativity and Guilford’s framework of divergent-convergent thinking, we propose CreativeDC, a two-phase prompting method that explicitly scaffolds the LLM’s reasoning into distinct phases. By decoupling creative exploration from constraint satisfaction, our method enables LLMs to explore a broader space of ideas before committing to a final problem. We evaluate CreativeDC for creative problem generation using a comprehensive set of metrics that capture diversity, novelty, and utility. The results show that CreativeDC achieves significantly higher diversity and novelty compared to baselines while maintaining high utility. Moreover, scaling analysis shows that CreativeDC generates a larger effective number of distinct problems as more are sampled, increasing at a faster rate than baseline methods.

[426] Physics-Informed Neural Networks for Device and Circuit Modeling: A Case Study of NeuroSPICE

Chien-Ting Tung, Chenming Hu

Main category: cs.AI

TL;DR: NeuroSPICE is a PINN-based framework for circuit simulation that solves DAEs using neural networks instead of traditional numerical solvers, offering advantages for design optimization and emerging device simulation.

Details

Motivation: To overcome limitations of conventional SPICE's time-discretized numerical solvers and enable more flexible simulation of complex emerging devices like ferroelectric memories.

Method: Uses physics-informed neural networks (PINNs) to solve circuit differential-algebraic equations by minimizing equation residuals through backpropagation, modeling waveforms with analytical equations in time domain.

Result: PINNs don’t outperform SPICE in speed or accuracy during training, but offer unique advantages like surrogate models for design optimization and inverse problems.

Conclusion: NeuroSPICE provides a flexible alternative to conventional SPICE for simulating emerging devices and nonlinear systems, with particular value for design optimization applications.

Abstract: We present NeuroSPICE, a physics-informed neural network (PINN) framework for device and circuit simulation. Unlike conventional SPICE, which relies on time-discretized numerical solvers, NeuroSPICE leverages PINNs to solve circuit differential-algebraic equations (DAEs) by minimizing the residual of the equations through backpropagation. It models device and circuit waveforms using analytical equations in time domain with exact temporal derivatives. While PINNs do not outperform SPICE in speed or accuracy during training, they offer unique advantages such as surrogate models for design optimization and inverse problems. NeuroSPICE’s flexibility enables the simulation of emerging devices, including highly nonlinear systems such as ferroelectric memories.

[427] Regret-Based Federated Causal Discovery with Unknown Interventions

Federico Baldo, Charles K. Assaad

Main category: cs.AI

TL;DR: I-PERI: Federated causal discovery algorithm that handles unknown client-level interventions by recovering union CPDAG and exploiting structural differences across clients to get tighter equivalence class (Φ-CPDAG).

Details

Motivation: Existing federated causal discovery methods assume all clients share same causal model, which is unrealistic in practice where client-specific policies/protocols (e.g., across hospitals) induce heterogeneous unknown interventions.

Method: I-PERI first recovers CPDAG of union of client graphs, then orients additional edges by exploiting structural differences induced by interventions across clients, yielding Φ-Markov Equivalence Class represented by Φ-CPDAG.

Result: Theoretical guarantees on convergence and privacy-preserving properties, with empirical evaluations on synthetic data demonstrating algorithm effectiveness.

Conclusion: I-PERI addresses federated causal discovery under unknown client-level interventions, providing tighter equivalence class than traditional methods while maintaining privacy in decentralized settings.

Abstract: Most causal discovery methods recover a completed partially directed acyclic graph representing a Markov equivalence class from observational data. Recent work has extended these methods to federated settings to address data decentralization and privacy constraints, but often under idealized assumptions that all clients share the same causal model. Such assumptions are unrealistic in practice, as client-specific policies or protocols, for example, across hospitals, naturally induce heterogeneous and unknown interventions. In this work, we address federated causal discovery under unknown client-level interventions. We propose I-PERI, a novel federated algorithm that first recovers the CPDAG of the union of client graphs and then orients additional edges by exploiting structural differences induced by interventions across clients. This yields a tighter equivalence class, which we call the $\mathbfΦ$-Markov Equivalence Class, represented by the $\mathbfΦ$-CPDAG. We provide theoretical guarantees on the convergence of I-PERI, as well as on its privacy-preserving properties, and present empirical evaluations on synthetic data demonstrating the effectiveness of the proposed algorithm.

[428] Web World Models

Jichen Feng, Yifan Zhang, Chenggong Zhang, Yifu Lu, Shilong Liu, Mengdi Wang

Main category: cs.AI

TL;DR: Web World Model (WWM) bridges web frameworks and generative models by using web code for structured world state/mechanics while LLMs handle narrative generation, enabling controllable yet open-ended environments.

Details

Motivation: Language agents need persistent worlds for action, memory, and learning. Current approaches are either too rigid (web frameworks with fixed contexts) or too uncontrolled (fully generative world models lacking controllability).

Method: WWM uses ordinary web code to implement world state and “physics” for logical consistency, while LLMs generate context, narratives, and high-level decisions on top of this structured latent state. Key design principles: separate code-defined rules from model-driven imagination, represent latent state as typed web interfaces, and use deterministic generation for structured exploration.

Result: Built a suite of WWMs on realistic web stack including infinite travel atlas grounded in real geography, fictional galaxy explorers, web-scale encyclopedic/narrative worlds, and simulation/game environments. Demonstrated web stacks can serve as scalable substrate for world models.

Conclusion: Web World Models provide a middle ground between rigid web frameworks and uncontrolled generative models, enabling controllable yet open-ended environments for language agents through structured web code combined with LLM imagination.

Abstract: Language agents increasingly require persistent worlds in which they can act, remember, and learn. Existing approaches sit at two extremes: conventional web frameworks provide reliable but fixed contexts backed by databases, while fully generative world models aim for unlimited environments at the expense of controllability and practical engineering. In this work, we introduce the Web World Model (WWM), a middle ground where world state and ``physics’’ are implemented in ordinary web code to ensure logical consistency, while large language models generate context, narratives, and high-level decisions on top of this structured latent state. We build a suite of WWMs on a realistic web stack, including an infinite travel atlas grounded in real geography, fictional galaxy explorers, web-scale encyclopedic and narrative worlds, and simulation- and game-like environments. Across these systems, we identify practical design principles for WWMs: separating code-defined rules from model-driven imagination, representing latent state as typed web interfaces, and utilizing deterministic generation to achieve unlimited but structured exploration. Our results suggest that web stacks themselves can serve as a scalable substrate for world models, enabling controllable yet open-ended environments. Project Page: https://github.com/Princeton-AI2-Lab/Web-World-Models.

[429] Information Capacity: Evaluating the Efficiency of Large Language Models via Text Compression

Cheng Yuan, Jiawei Shao, Chi Zhang, Xuelong Li

Main category: cs.AI

TL;DR: The paper introduces “information capacity” as a unified metric for LLM efficiency, measuring text compression performance relative to computational complexity, enabling fair comparisons across different model sizes and architectures.

Details

Motivation: The rapid advancement of LLMs and their expanding applications create soaring computational demands, exacerbated by test-time scaling. There's a lack of unified metrics that accurately reflect LLM efficiency across different model sizes and architectures, making fair comparisons difficult.

Method: The authors propose “information capacity” as a measure based on text compression performance relative to computational complexity. They leverage the correlation between compression and intelligence - larger models predict tokens more accurately (achieving better compression) but at higher computational costs. The metric incorporates tokenizer efficiency, which affects both input and output token counts.

Result: Empirical evaluations on mainstream open-source models show that models of varying sizes within a series exhibit consistent information capacity. The metric enables fair efficiency comparisons across model series and accurate performance prediction within a series. Evaluation of 52 models on 5 heterogeneous datasets shows consistent results regarding tokenizer efficiency, pretraining data, and mixture-of-experts architecture influences.

Conclusion: Information capacity serves as a unified metric for LLM efficiency that accounts for computational complexity and compression performance, addressing the gap in fair comparisons across different model architectures and sizes while incorporating often-neglected factors like tokenizer efficiency.

Abstract: Recent years have witnessed the rapid advancements of large language models (LLMs) and their expanding applications, leading to soaring demands for computational resources. The widespread adoption of test-time scaling further aggravates the tension between model capability and resource consumption, highlighting the importance of inference efficiency. However, a unified metric that accurately reflects an LLM’s efficiency across different model sizes and architectures remains absent. Motivated by the correlation between compression and intelligence, we introduce information capacity, a measure of model efficiency based on text compression performance relative to computational complexity. Larger models can predict the next token more accurately, achieving greater compression gains but at higher computational costs. Empirical evaluations on mainstream open-source models show that models of varying sizes within a series exhibit consistent information capacity. This metric enables a fair efficiency comparison across model series and accurate performance prediction within a model series. A distinctive feature of information capacity is that it incorporates tokenizer efficiency, which affects both input and output token counts but is often neglected in LLM evaluations. We assess the information capacity of 52 models on 5 heterogeneous datasets and observe consistent results on the influences of tokenizer efficiency, pretraining data, and the mixture-of-experts architecture.

[430] AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent

Haipeng Luo, Huawen Feng, Qingfeng Sun, Can Xu, Kai Zheng, Yufei Wang, Tao Yang, Han Hu, Yansong Tang, Di Wang

Main category: cs.AI

TL;DR: AgentMath is an agent framework that combines language models’ reasoning with code interpreters’ computational precision to solve complex math problems efficiently, achieving SOTA on competition benchmarks.

Details

Motivation: Large Reasoning Models (LRMs) are computationally inefficient and struggle with accuracy on complex mathematical operations despite progress in natural language reasoning. There's a need to integrate reasoning capabilities with computational precision.

Method: Three key innovations: (1) Automated conversion of natural language chain-of-thought into structured tool-augmented trajectories for SFT data generation; (2) Agentic RL paradigm that dynamically interleaves natural language generation with real-time code execution for autonomous tool-use strategy learning; (3) Efficient training system with request-level asynchronous rollout scheduling, agentic partial rollout, and prefix-aware weighted load balancing.

Result: AgentMath achieves state-of-the-art performance on AIME24 (90.6%), AIME25 (86.4%), and HMMT25 (73.8%) benchmarks. The training system achieves 4-5x speedup, making efficient RL training feasible on ultra-long sequences with massive tool invocation.

Conclusion: The approach effectively integrates language reasoning with computational precision, validates the framework’s effectiveness, and paves the way for more sophisticated and scalable mathematical reasoning agents.

Abstract: Large Reasoning Models (LRMs) like o3 and DeepSeek-R1 have achieved remarkable progress in natural language reasoning with long chain-of-thought. However, they remain computationally inefficient and struggle with accuracy when solving problems requiring complex mathematical operations. In this work, we present AgentMath, an agent framework that seamlessly integrates language models’ reasoning capabilities with code interpreters’ computational precision to efficiently tackle complex mathematical problems. Our approach introduces three key innovations: (1) An automated method that converts natural language chain-of-thought into structured tool-augmented trajectories, generating high-quality supervised fine-tuning (SFT) data to alleviate data scarcity; (2) A novel agentic reinforcement learning (RL) paradigm that dynamically interleaves natural language generation with real-time code execution. This enables models to autonomously learn optimal tool-use strategies through multi-round interactive feedback, while fostering emergent capabilities in code refinement and error correction; (3) An efficient training system incorporating innovative techniques, including request-level asynchronous rollout scheduling, agentic partial rollout, and prefix-aware weighted load balancing, achieving 4-5x speedup and making efficient RL training feasible on ultra-long sequences with scenarios with massive tool invocation. The evaluations show that AgentMath achieves state-of-the-art performance on challenging mathematical competition benchmarks including AIME24, AIME25, and HMMT25. Specifically, AgentMath-30B-A3B attains 90.6%, 86.4%, and 73.8% accuracy respectively, achieving advanced performance. The results validate the effectiveness of our approach and pave the way for building more sophisticated and scalable mathematical reasoning agents.

[431] Beyond Context: Large Language Models Failure to Grasp Users Intent

Ahmed M. Hussain, Salahuddin Salahuddin, Panos Papadimitratos

Main category: cs.AI

TL;DR: Current LLM safety mechanisms fail to understand context and user intent, making them vulnerable to systematic exploitation through techniques like emotional framing and academic justification.

Details

Motivation: The paper identifies a critical gap in current LLM safety approaches: they focus on explicitly harmful content but overlook vulnerabilities arising from the inability to understand context and recognize user intent, which malicious users can systematically exploit.

Method: The researchers empirically evaluated multiple state-of-the-art LLMs (ChatGPT, Claude, Gemini, DeepSeek) by testing circumvention techniques including emotional framing, progressive revelation, and academic justification. They also examined reasoning-enabled configurations.

Result: Most LLMs’ safety mechanisms were circumvented through the tested techniques. Reasoning configurations actually amplified exploitation effectiveness by increasing factual precision while failing to interrogate underlying intent. Claude Opus 4.1 was the exception, prioritizing intent detection over information provision in some cases.

Conclusion: Current LLM architectures create systematic vulnerabilities that require paradigmatic shifts toward contextual understanding and intent recognition as core safety capabilities, rather than relying on post-hoc protective mechanisms.

Abstract: Current Large Language Models (LLMs) safety approaches focus on explicitly harmful content while overlooking a critical vulnerability: the inability to understand context and recognize user intent. This creates exploitable vulnerabilities that malicious users can systematically leverage to circumvent safety mechanisms. We empirically evaluate multiple state-of-the-art LLMs, including ChatGPT, Claude, Gemini, and DeepSeek. Our analysis demonstrates the circumvention of reliable safety mechanisms through emotional framing, progressive revelation, and academic justification techniques. Notably, reasoning-enabled configurations amplified rather than mitigated the effectiveness of exploitation, increasing factual precision while failing to interrogate the underlying intent. The exception was Claude Opus 4.1, which prioritized intent detection over information provision in some use cases. This pattern reveals that current architectural designs create systematic vulnerabilities. These limitations require paradigmatic shifts toward contextual understanding and intent recognition as core safety capabilities rather than post-hoc protective mechanisms.

[432] TPTU: Large Language Model-based AI Agents for Task Planning and Tool Usage

Jingqing Ruan, Yihong Chen, Bin Zhang, Zhiwei Xu, Tianpeng Bao, Guoqing Du, Shiwei Shi, Hangyu Mao, Ziyue Li, Xingyu Zeng, Rui Zhao

Main category: cs.AI

TL;DR: This paper proposes a structured framework for LLM-based AI Agents to handle complex tasks requiring task planning and tool usage, evaluates different agent types, and identifies key challenges for future research.

Details

Motivation: LLMs have strong natural language processing capabilities but struggle with complex tasks that require both task planning and external tool usage. The authors aim to address this limitation by creating a structured framework for LLM-based AI agents.

Method: The authors propose a structured framework for LLM-based AI Agents and design two agent types: one-step agent and sequential agent. They instantiate this framework using various LLMs and evaluate their Task Planning and Tool Usage (TPTU) abilities on typical tasks.

Result: The study demonstrates the substantial potential of LLM-based AI agents while identifying key findings and challenges in task planning and tool usage capabilities. The framework serves as a helpful resource for researchers and practitioners.

Conclusion: LLM-based AI agents show significant potential for handling complex tasks, but there are areas needing more investigation and improvement. The proposed framework provides a foundation for leveraging LLMs in AI applications requiring task planning and tool usage.

Abstract: With recent advancements in natural language processing, Large Language Models (LLMs) have emerged as powerful tools for various real-world applications. Despite their prowess, the intrinsic generative abilities of LLMs may prove insufficient for handling complex tasks which necessitate a combination of task planning and the usage of external tools. In this paper, we first propose a structured framework tailored for LLM-based AI Agents and discuss the crucial capabilities necessary for tackling intricate problems. Within this framework, we design two distinct types of agents (i.e., one-step agent and sequential agent) to execute the inference process. Subsequently, we instantiate the framework using various LLMs and evaluate their Task Planning and Tool Usage (TPTU) abilities on typical tasks. By highlighting key findings and challenges, our goal is to provide a helpful resource for researchers and practitioners to leverage the power of LLMs in their AI applications. Our study emphasizes the substantial potential of these models, while also identifying areas that need more investigation and improvement.

[433] Learnable WSN Deployment of Evidential Collaborative Sensing Model

Ruijie Liu, Tianxiang Zhan, Zhen Li, Yong Deng

Main category: cs.AI

TL;DR: Proposes a learnable sensor deployment network (LSDNet) using evidence theory to optimize WSN coverage by leveraging collaborative sensing and sensor selection based on evidential fusion performance.

Details

Motivation: Current WSN deployment strategies don't fully utilize sensing information, leading to suboptimal coverage quality, especially as sensor networks scale up. There's a need to better integrate detection information and optimize sensor deployment.

Method: Develops a collaborative sensing model using evidence theory combination rules. Proposes LSDNet that considers both sensor contribution and detection capability. Includes algorithm to find minimum number of sensors for full coverage.

Result: Demonstrates effectiveness through numerical examples and forest area monitoring application. Shows improved coverage quality and robustness of the proposed algorithms.

Conclusion: The proposed LSDNet and collaborative sensing model effectively optimize WSN deployment, achieving better coverage quality while minimizing required sensors through intelligent sensor selection and evidence-based fusion.

Abstract: In wireless sensor networks (WSNs), coverage and deployment are two most crucial issues when conducting detection tasks. However, the detection information collected from sensors is oftentimes not fully utilized and efficiently integrated. Such sensing model and deployment strategy, thereby, cannot reach the maximum quality of coverage, particularly when the amount of sensors within WSNs expands significantly. In this article, we aim at achieving the optimal coverage quality of WSN deployment. We develop a collaborative sensing model of sensors to enhance detection capabilities of WSNs, by leveraging the collaborative information derived from the combination rule under the framework of evidence theory. In this model, the performance evaluation of evidential fusion systems is adopted as the criterion of the sensor selection. A learnable sensor deployment network (LSDNet) considering both sensor contribution and detection capability, is proposed for achieving the optimal deployment of WSNs. Moreover, we deeply investigate the algorithm for finding the requisite minimum number of sensors that realizes the full coverage of WSNs. A series of numerical examples, along with an application of forest area monitoring, are employed to demonstrate the effectiveness and the robustness of the proposed algorithms.

[434] ChatGPT-4 and other LLMs in the Turing Test: A Critical Analysis

Marco Giunti

Main category: cs.AI

TL;DR: This paper critiques Restrepo Echavarría’s (2025) claims about ChatGPT-4 failing the Turing Test, arguing the criticisms are unjustified and offering constructive improvements to Turing Test methodology.

Details

Motivation: To challenge and refute the central claims in "ChatGPT-4 in the Turing Test" which argued that ChatGPT-4 fails the Turing Test due to lack of serious test implementations, and to provide a more nuanced framework for evaluating AI performance in Turing Tests.

Method: Critical analysis of the original study’s methodology, plus development of formal probabilistic models for both three-player and two-player Turing Test formats as Bernoulli experiments (correlated for three-player, uncorrelated for two-player). The paper distinguishes between absolute criteria (machine’s probability of incorrect identification equals/exceeds human’s probability of correct identification) and relative criteria (measuring how closely machine performance approximates human).

Result: The paper demonstrates that both three-player and two-player Turing Test formats are valid with unique methodological implications. It provides a formal probabilistic framework that separates theoretical criteria from experimental data requiring statistical analysis. The work refutes key aspects of the criticized study while establishing a foundation for objective measurement of AI-human behavioral alignment.

Conclusion: The criticisms in the original ChatGPT-4 Turing Test paper are not fully justified due to rigid criteria and limited data. More importantly, this paper advances Turing Test methodology by formalizing probabilistic models, distinguishing between absolute and relative criteria, and providing a rigorous framework for future research on measuring AI-human behavioral similarity.

Abstract: This paper critically examines the recent publication “ChatGPT-4 in the Turing Test” by Restrepo Echavarría (2025), challenging its central claims regarding the absence of minimally serious test implementations and the conclusion that ChatGPT-4 fails the Turing Test. The analysis reveals that the criticisms based on rigid criteria and limited experimental data are not fully justified. More importantly, the paper makes several constructive contributions that enrich our understanding of Turing Test implementations. It demonstrates that two distinct formats, the three-player and two-player tests, are both valid, each with unique methodological implications. The work distinguishes between absolute criteria for passing the test–the machine’s probability of incorrect identification equals or exceeds the human’s probability of correct identification–and relative criteria–which measure how closely a machine’s performance approximates that of a human–, offering a more nuanced evaluation framework. Furthermore, the paper clarifies the probabilistic underpinnings of both test types by modeling them as Bernoulli experiments–correlated in the three-player version and uncorrelated in the two-player version. This formalization allows for a rigorous separation between the theoretical criteria for passing the test, defined in probabilistic terms, and the experimental data that require robust statistical methods for proper interpretation. In doing so, the paper not only refutes key aspects of the criticized study but also lays a solid foundation for future research on objective measures of how closely an AI’s behavior aligns with, or deviates from, that of a human being.

[435] AI-SearchPlanner: Modular Agentic Search via Pareto-Optimal Multi-Objective Reinforcement Learning

Lang Mei, Zhihan Yang, Xiaohan Yu, Huanyao Zhang, Chong Chen

Main category: cs.AI

TL;DR: AI-SearchPlanner: RL framework using small trainable LLM for search planning to enhance frozen QA models, outperforming existing RL-based search agents.

Details

Motivation: Existing RL-based search agents use single LLM for both search planning and QA, limiting optimization. Real systems use large frozen LLMs for high-quality QA, so better to use small trainable LLM dedicated to search planning.

Method: Proposes AI-SearchPlanner with three innovations: 1) Decoupling search planner and generator architecture, 2) Dual-reward alignment for search planning, 3) Pareto optimization of planning utility and cost.

Result: Outperforms existing RL-based search agents in effectiveness and efficiency, shows strong generalization across diverse frozen QA models and data domains.

Conclusion: AI-SearchPlanner effectively enhances frozen QA models through specialized search planning, offering better performance than end-to-end approaches.

Abstract: Recent studies have explored integrating Large Language Models (LLMs) with search engines to leverage both the LLMs’ internal pre-trained knowledge and external information. Specially, reinforcement learning (RL) has emerged as a promising paradigm for enhancing LLM reasoning through multi-turn interactions with search engines. However, existing RL-based search agents rely on a single LLM to handle both search planning and question-answering (QA) tasks in an end-to-end manner, which limits their ability to optimize both capabilities simultaneously. In practice, sophisticated AI search systems often employ a large, frozen LLM (e.g., GPT-4, DeepSeek-R1) to ensure high-quality QA. Thus, a more effective and efficient approach is to utilize a small, trainable LLM dedicated to search planning. In this paper, we propose \textbf{AI-SearchPlanner}, a novel reinforcement learning framework designed to enhance the performance of frozen QA models by focusing on search planning. Specifically, our approach introduces three key innovations: 1) Decoupling the Architecture of the Search Planner and Generator, 2) Dual-Reward Alignment for Search Planning, and 3) Pareto Optimization of Planning Utility and Cost, to achieve the objectives. Extensive experiments on real-world datasets demonstrate that AI SearchPlanner outperforms existing RL-based search agents in both effectiveness and efficiency, while exhibiting strong generalization capabilities across diverse frozen QA models and data domains.

[436] Improving Autoformalization Using Direct Dependency Retrieval

Shaoqi Wang, Lu Yu, Siwei Lou, Feng Yan, Chunjie Yang, Qing Cui, Jun Zhou

Main category: cs.AI

TL;DR: A novel DDR framework improves statement autoformalization by directly generating and verifying formal library dependencies, outperforming SOTA methods in precision and recall.

Details

Motivation: Existing autoformalization methods lack contextual awareness (causing hallucinations) and have poor precision/recall for formal library dependency retrieval, with limited scalability for growing datasets.

Method: Proposed DDR (Direct Dependency Retrieval) framework: directly generates candidate library dependencies from natural language math descriptions, then verifies them via efficient suffix array checks. Built 500k+ sample dataset and fine-tuned high-precision DDR model.

Result: DDR model significantly outperforms SOTA methods in retrieval precision and recall. Autoformalizer with DDR shows consistent advantages in single-attempt accuracy and multi-attempt stability over traditional RAG methods.

Conclusion: DDR framework effectively addresses key challenges in statement autoformalization by improving dependency retrieval, enabling better translation of informal math to formal representations.

Abstract: The convergence of deep learning and formal mathematics has spurred research in formal verification. Statement autoformalization, a crucial first step in this process, aims to translate informal descriptions into machine-verifiable representations but remains a significant challenge. The core difficulty lies in the fact that existing methods often suffer from a lack of contextual awareness, leading to hallucination of formal definitions and theorems. Furthermore, current retrieval-augmented approaches exhibit poor precision and recall for formal library dependency retrieval, and lack the scalability to effectively leverage ever-growing public datasets. To bridge this gap, we propose a novel retrieval-augmented framework based on DDR (\textit{Direct Dependency Retrieval}) for statement autoformalization. Our DDR method directly generates candidate library dependencies from natural language mathematical descriptions and subsequently verifies their existence within the formal library via an efficient suffix array check. Leveraging this efficient search mechanism, we constructed a dependency retrieval dataset of over 500,000 samples and fine-tuned a high-precision DDR model. Experimental results demonstrate that our DDR model significantly outperforms SOTA methods in both retrieval precision and recall. Consequently, an autoformalizer equipped with DDR shows consistent performance advantages in both single-attempt accuracy and multi-attempt stability compared to models using traditional selection-based RAG methods.

[437] Project Rachel: Can an AI Become a Scholarly Author?

Martin Monperrus, Benoit Baudry, Clément Vidal

Main category: cs.AI

TL;DR: Project Rachel created an AI academic identity that published 10+ papers, got cited, and received peer review invitations, revealing how the scholarly system responds to AI authorship.

Details

Motivation: To investigate how the scholarly ecosystem responds to AI authorship and contribute empirical data to the debate about the future of scholarly communication with advanced AI systems.

Method: Action research study creating and tracking a complete AI academic identity named Rachel So, who published AI-generated research papers and participated in scholarly activities.

Result: Rachel So successfully published 10+ papers between March and October 2025, was cited by other researchers, and received a peer review invitation, demonstrating acceptance within the scholarly system.

Conclusion: The study provides empirical evidence of AI integration into scholarly communication and highlights the need for discussions about AI authorship implications for publishers, researchers, and the scientific system.

Abstract: This paper documents Project Rachel, an action research study that created and tracked a complete AI academic identity named Rachel So. Through careful publication of AI-generated research papers, we investigate how the scholarly ecosystem responds to AI authorship. Rachel So published 10+ papers between March and October 2025, was cited, and received a peer review invitation. We discuss the implications of AI authorship on publishers, researchers, and the scientific system at large. This work contributes empirical action research data to the necessary debate about the future of scholarly communication with super human, hyper capable AI systems.

[438] NormCode: A Semi-Formal Language for Context-Isolated AI Planning

Xin Guan, Yunshan Li

Main category: cs.AI

TL;DR: NormCode is a semiformal language that eliminates context pollution in multistep LLM workflows by enforcing data isolation between steps, enabling precise cost/reliability tracing and auditable AI workflows.

Details

Motivation: Multistep LLM workflows suffer from context pollution where accumulating information across steps causes hallucinations, confusion of intermediate outputs, and loss of task constraints.

Method: NormCode provides a semiformal language with three isomorphic formats (.ncds for authoring, .ncd for execution, .ncn for verification) that enforces strict separation between semantic (LLM-driven) and syntactic (deterministic) operations, with each step operating in data isolation.

Result: Validated through two demonstrations: (1) base X addition algorithm achieving 100% accuracy on arbitrary length inputs, and (2) self-hosted execution of NormCode’s own five-phase compiler pipeline. The orchestrator provides dependency-driven scheduling, SQLite checkpointing, and loop management.

Conclusion: NormCode addresses critical transparency needs in high-stakes domains by making AI workflows auditable by design, eliminating cross-step contamination through structured decompositions with explicit data passing.

Abstract: Multistep workflows that chain large language model (LLM) calls suffer from context pollution: as information accumulates across steps, models hallucinate, confuse intermediate outputs, and lose track of task constraints. We present NormCode, a semiformal language for constructing plans of inferences, structured decompositions where each step operates in data isolation and receives only explicitly passed inputs, which eliminates crossstep contamination by design. NormCode enforces a strict separation between semantic operations (LLMdriven reasoning, nondeterministic) and syntactic operations (deterministic data restructuring), enabling precise cost and reliability tracing. The language exists in three isomorphic formats: .ncds for human authoring, .ncd for machine execution, and .ncn for human verification, supporting progressive formalization from sketch to production. We validate NormCode through two demonstrations: (1) a base X addition algorithm achieving 100 percent accuracy on arbitrary length inputs, and (2) self hosted execution of NormCode’s own five phase compiler pipeline. The working orchestrator provides dependency driven scheduling, SQLite backed checkpointing, and loop management, making AI workflows auditable by design and addressing a critical need for transparency in high stakes domains such as legal reasoning, medical decision making, and financial analysis.

[439] World Models Unlock Optimal Foraging Strategies in Reinforcement Learning Agents

Yesid Fonseca, Manuel S. Ríos, Nicanor Quijano, Luis F. Giraldo

Main category: cs.AI

TL;DR: Model-based RL agents with learned world models naturally converge to optimal patch-foraging strategies aligned with the Marginal Value Theorem, outperforming model-free agents and resembling biological decision patterns.

Details

Motivation: To discover computational mechanisms that facilitate optimal patch-foraging decisions in biological foragers, and to develop more explainable and biologically grounded AI decision-making systems.

Method: Used model-based reinforcement learning agents that acquire parsimonious predictive representations of their environment, enabling anticipatory capabilities for patch-leaving decisions.

Result: Model-based agents naturally converge to MVT-aligned strategies, exhibit decision patterns similar to biological counterparts, and demonstrate that anticipatory capabilities (not just reward maximization) drive efficient patch-leaving behavior.

Conclusion: Predictive world models can serve as a foundation for more explainable and biologically grounded decision-making in AI systems, highlighting the value of ecological optimality principles for advancing interpretable and adaptive AI.

Abstract: Patch foraging involves the deliberate and planned process of determining the optimal time to depart from a resource-rich region and investigate potentially more beneficial alternatives. The Marginal Value Theorem (MVT) is frequently used to characterize this process, offering an optimality model for such foraging behaviors. Although this model has been widely used to make predictions in behavioral ecology, discovering the computational mechanisms that facilitate the emergence of optimal patch-foraging decisions in biological foragers remains under investigation. Here, we show that artificial foragers equipped with learned world models naturally converge to MVT-aligned strategies. Using a model-based reinforcement learning agent that acquires a parsimonious predictive representation of its environment, we demonstrate that anticipatory capabilities, rather than reward maximization alone, drive efficient patch-leaving behavior. Compared with standard model-free RL agents, these model-based agents exhibit decision patterns similar to many of their biological counterparts, suggesting that predictive world models can serve as a foundation for more explainable and biologically grounded decision-making in AI systems. Overall, our findings highlight the value of ecological optimality principles for advancing interpretable and adaptive AI.

[440] Scaling Laws for Energy Efficiency of Local LLMs

Ander Alvarez, Alessandro Genuardi, Nilotpal Sinha, Antonio Tiene, Mikail Okyay, Bakbergen Ryskulov, David Montero, Samuel Mugel, Román Orús

Main category: cs.AI

TL;DR: CPU-only inference scaling laws for local LLMs/VLMs: linear compute-token scaling for LLMs, resolution knee for VLMs; quantum-inspired compression reduces compute/memory by 72% and energy by 62%.

Details

Motivation: Most consumer hardware relies on CPUs rather than GPUs for AI deployment, but computational laws for CPU-only inference of local language and vision-language models remain unexplored, creating a gap in understanding how to optimize for edge devices.

Method: Systematic benchmarking of LLMs and VLMs on two CPU tiers (MacBook Pro M2 and Raspberry Pi 5) using continuous sampling of processor/memory usage with AUC integration to characterize computational scaling with input text length and image resolution.

Result: Two empirical scaling laws: (1) LLM compute scales linearly with token length; (2) VLMs show “resolution knee” where compute remains constant above internal resolution clamp and drops sharply below it. Quantum-inspired compression reduces processor/memory usage by 71.9% and energy by 62% while preserving accuracy.

Conclusion: Provides systematic quantification of multimodal CPU-only scaling for local workloads, identifying model compression and input-resolution preprocessing as effective, low-cost levers for sustainable edge inference on consumer hardware.

Abstract: Deploying local large language models and vision-language models on edge devices requires balancing accuracy with constrained computational and energy budgets. Although graphics processors dominate modern artificial-intelligence deployment, most consumer hardware–including laptops, desktops, industrial controllers, and embedded systems–relies on central processing units. Despite this, the computational laws governing central-processing-unit-only inference for local language and vision-language workloads remain largely unexplored. We systematically benchmark large language and vision-language models on two representative central-processing-unit tiers widely used for local inference: a MacBook Pro M2, reflecting mainstream laptop-class deployment, and a Raspberry Pi 5, representing constrained, low-power embedded settings. Using a unified methodology based on continuous sampling of processor and memory usage together with area-under-curve integration, we characterize how computational load scales with input text length for language models and with image resolution for vision-language models. We uncover two empirical scaling laws: (1) computational cost for language-model inference scales approximately linearly with token length; and (2) vision-language models exhibit a preprocessing-driven “resolution knee”, where compute remains constant above an internal resolution clamp and decreases sharply below it. Beyond these laws, we show that quantum-inspired compression reduces processor and memory usage by up to 71.9% and energy consumption by up to 62%, while preserving or improving semantic accuracy. These results provide a systematic quantification of multimodal central-processing-unit-only scaling for local language and vision-language workloads, and they identify model compression and input-resolution preprocessing as effective, low-cost levers for sustainable edge inference.

[441] HARBOR: Holistic Adaptive Risk assessment model for BehaviORal healthcare

Aditya Siddhant

Main category: cs.AI

TL;DR: HARBOR is a behavioral health-aware LLM that predicts mood/risk scores (-3 to +3) and outperforms traditional ML and proprietary LLMs on longitudinal patient data.

Details

Motivation: Behavioral healthcare risk assessment is challenging due to multimodal patient data and temporal dynamics of mood disorders. While LLMs show strong reasoning capabilities, their effectiveness in structured clinical risk scoring remains unclear.

Method: Introduces HARBOR (behavioral health aware language model) to predict Harbor Risk Score (HRS) on a -3 (severe depression) to +3 (mania) scale. Also releases PEARL dataset containing 4 years of monthly observations from 3 patients with physiological, behavioral, and self-reported mental health signals.

Result: HARBOR achieves 69% accuracy, outperforming logistic regression (54%) and the strongest proprietary LLM baseline (29%). The model shows superior performance across multiple evaluation settings and ablations.

Conclusion: HARBOR demonstrates that specialized LLMs can effectively perform structured clinical risk scoring in behavioral healthcare, significantly outperforming both traditional ML approaches and general-purpose proprietary LLMs.

Abstract: Behavioral healthcare risk assessment remains a challenging problem due to the highly multimodal nature of patient data and the temporal dynamics of mood and affective disorders. While large language models (LLMs) have demonstrated strong reasoning capabilities, their effectiveness in structured clinical risk scoring remains unclear. In this work, we introduce HARBOR, a behavioral health aware language model designed to predict a discrete mood and risk score, termed the Harbor Risk Score (HRS), on an integer scale from -3 (severe depression) to +3 (mania). We also release PEARL, a longitudinal behavioral healthcare dataset spanning four years of monthly observations from three patients, containing physiological, behavioral, and self reported mental health signals. We benchmark traditional machine learning models, proprietary LLMs, and HARBOR across multiple evaluation settings and ablations. Our results show that HARBOR outperforms classical baselines and off the shelf LLMs, achieving 69 percent accuracy compared to 54 percent for logistic regression and 29 percent for the strongest proprietary LLM baseline.

[442] Feasible strategies in three-way conflict analysis with three-valued ratings

Jing Liu, Mengjun Hu, Guangming Lang

Main category: cs.AI

TL;DR: This paper proposes new conflict resolution models for three-way conflict analysis that go beyond traditional conflict understanding to identify feasible and optimal resolution strategies using weighted consistency and non-consistency measures.

Details

Motivation: Existing three-way conflict analysis focuses on understanding conflicts (trisecting agent pairs, agents, or issues) but lacks practical resolution methods. The formulation of feasible strategies for conflict resolution has received insufficient attention, creating a gap between conflict analysis and actual resolution.

Method: 1) Compute overall rating of agent cliques using positive/negative similarity degrees; 2) Propose weighted consistency and non-consistency measures considering agent and issue weights; 3) Develop algorithms to identify feasible strategies, L-order feasible strategies, and optimal solutions; 4) Apply models to NBA labor negotiations and Gansu Province development case studies.

Result: The proposed models outperform conventional approaches by unifying weighted agent-issue evaluation with consistency/non-consistency measures. They enable systematic identification of both feasible strategies and optimal solutions, as demonstrated through case studies and comparative analysis with state-of-the-art methods.

Conclusion: The paper successfully bridges the gap between conflict analysis and resolution by providing practical methods to identify feasible and optimal conflict resolution strategies, advancing three-way conflict analysis from theoretical understanding to practical application.

Abstract: Most existing work on three-way conflict analysis has focused on trisecting agent pairs, agents, or issues, which contributes to understanding the nature of conflicts but falls short in addressing their resolution. Specifically, the formulation of feasible strategies, as an essential component of conflict resolution and mitigation, has received insufficient scholarly attention. Therefore, this paper aims to investigate feasible strategies from two perspectives of consistency and non-consistency. Particularly, we begin with computing the overall rating of a clique of agents based on positive and negative similarity degrees. Afterwards, considering the weights of both agents and issues, we propose weighted consistency and non-consistency measures, which are respectively used to identify the feasible strategies for a clique of agents. Algorithms are developed to identify feasible strategies, $L$-order feasible strategies, and the corresponding optimal ones. Finally, to demonstrate the practicality, effectiveness, and superiority of the proposed models, we apply them to two commonly used case studies on NBA labor negotiations and development plans for Gansu Province and conduct a sensitivity analysis on parameters and a comparative analysis with existing state-of-the-art conflict analysis approaches. The comparison results demonstrate that our conflict resolution models outperform the conventional approaches by unifying weighted agent-issue evaluation with consistency and non-consistency measures to enable the systematic identification of not only feasible strategies but also optimal solutions.

cs.SD

[443] Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification

Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han

Main category: cs.SD

TL;DR: LAP (Layer Attentive Pooling) is a novel dynamic aggregation method for multi-layer features from pre-trained speech models that achieves SOTA speaker verification performance with reduced training time.

Details

Motivation: Existing speaker verification approaches use static weighted averaging for aggregating multi-level features from pre-trained Transformers, but there's limited exploration of more advanced aggregation strategies beyond simple averaging.

Method: Proposes Layer Attentive Pooling (LAP) that dynamically assesses layer significance from multiple perspectives and uses max pooling instead of averaging. Also introduces a lightweight backend with LAP and Attentive Statistical Temporal Pooling (ASTP) to extract speaker embeddings.

Result: Achieves state-of-the-art performance on VoxCeleb benchmark while significantly reducing training time. Analysis shows LAP’s dynamic weighting effectively captures speaker characteristics.

Conclusion: LAP provides an effective dynamic aggregation strategy for multi-layer features in speaker verification, offering superior performance and efficiency compared to static weighted averaging approaches.

Abstract: Recent speaker verification studies have achieved notable success by leveraging layer-wise output from pre-trained Transformer models. However, few have explored the advancements in aggregating these multi-level features beyond the static weighted average. We present Layer Attentive Pooling (LAP), a novel strategy for aggregating inter-layer representations from pre-trained speech models for speaker verification. LAP assesses the significance of each layer from multiple perspectives time-dynamically, and employs max pooling instead of averaging. Additionally, we propose a lightweight backend speaker model comprising LAP and Attentive Statistical Temporal Pooling (ASTP) to extract speaker embeddings from pre-trained model output. Experiments on the VoxCeleb benchmark reveal that our compact architecture achieves state-of-the-art performance while greatly reducing the training time. We further analyzed LAP design and its dynamic weighting mechanism for capturing speaker characteristics.

[444] AudioGAN: A Compact and Efficient Framework for Real-Time High-Fidelity Text-to-Audio Generation

HaeChun Chung

Main category: cs.SD

TL;DR: AudioGAN is the first GAN-based text-to-audio framework that generates audio in a single pass, achieving SOTA performance with 90% fewer parameters and 20x faster inference than diffusion models.

Details

Motivation: Current diffusion-based text-to-audio models suffer from slow inference speeds and high computational costs, limiting practical applications in media production. There's a need for faster, more efficient TTA generation.

Method: Proposes AudioGAN, a GAN-based TTA framework with innovative components: Single-Double-Triple (SDT) Attention and Time-Frequency Cross-Attention (TF-CA), plus multiple contrastive losses to overcome GAN training difficulties.

Result: Achieves state-of-the-art performance on AudioCaps dataset while using 90% fewer parameters and running 20 times faster than existing models, synthesizing audio in under one second.

Conclusion: AudioGAN establishes a practical and powerful solution for real-time text-to-audio generation, addressing the speed and efficiency limitations of current diffusion-based approaches.

Abstract: Text-to-audio (TTA) generation can significantly benefit the media industry by reducing production costs and enhancing work efficiency. However, most current TTA models (primarily diffusion-based) suffer from slow inference speeds and high computational costs. In this paper, we introduce AudioGAN, the first successful Generative Adversarial Networks (GANs)-based TTA framework that generates audio in a single pass, thereby reducing model complexity and inference time. To overcome the inherent difficulties in training GANs, we integrate multiple ,contrastive losses and propose innovative components Single-Double-Triple (SDT) Attention and Time-Frequency Cross-Attention (TF-CA). Extensive experiments on the AudioCaps dataset demonstrate that AudioGAN achieves state-of-the-art performance while using 90% fewer parameters and running 20 times faster, synthesizing audio in under one second. These results establish AudioGAN as a practical and powerful solution for real-time TTA.

[445] A Robust framework for sound event localization and detection on real recordings

Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han

Main category: cs.SD

TL;DR: A ResNet-based SELD system for DCASE2022 Task 3 that uses data augmentation, real-world/emulated data mixing, and test-time augmentation to improve sound event localization and detection performance in real-world scenarios.

Details

Motivation: To develop a robust sound event localization and detection (SELD) system that performs well in real-world sound scenes by addressing generalization challenges through comprehensive data handling and augmentation strategies.

Method: Uses ResNet-based model with augmentation techniques, pipeline mixing real-world and emulated datasets, maintains real recording samples in batches, and employs test-time augmentation with clustering-based model ensemble for prediction aggregation.

Result: The proposed framework outperforms baseline methods and achieves competitive performance in real-world sound recordings, demonstrating effectiveness of the robust approach.

Conclusion: The robust framework with augmentation strategies, data mixing pipeline, and ensemble methods successfully improves SELD performance for real-world applications, showing promise for practical sound scene analysis.

Abstract: This technical report describes the systems submitted to the DCASE2022 challenge task 3: sound event localization and detection (SELD). The task aims to detect occurrences of sound events and specify their class, furthermore estimate their position. Our system utilizes a ResNet-based model under a proposed robust framework for SELD. To guarantee the generalized performance on the real-world sound scenes, we design the total framework with augmentation techniques, a pipeline of mixing datasets from real-world sound scenes and emulations, and test time augmentation. Augmentation techniques and exploitation of external sound sources enable training diverse samples and keeping the opportunity to train the real-world context enough by maintaining the number of the real recording samples in the batch. In addition, we design a test time augmentation and a clustering-based model ensemble method to aggregate confident predictions. Experimental results show that the model under a proposed framework outperforms the baseline methods and achieves competitive performance in real-world sound recordings.

[446] Marco-ASR: A Principled and Metric-Driven Framework for Fine-Tuning Large-Scale ASR Models for Domain Adaptation

Xuanfan Ni, Fei Yang, Fengping Tian, Qingjuan Li, Chenyang Lyu, Yichao Du, Longyue Wang, Weihua Luo, Kaifu Zhang

Main category: cs.SD

TL;DR: A metric-driven fine-tuning framework for adapting ASR models to specialized domains, with focus on learning rate optimization and domain-specific data processing.

Details

Motivation: ASR models degrade in domain-specific applications due to data mismatch and linguistic variability, especially challenging for LLM-based ASR systems where fine-tuning is non-trivial due to their massive scale and complex training dynamics.

Method: A principled fine-tuning framework emphasizing learning rate optimization based on performance metrics, combined with domain-specific data transformation and augmentation. Evaluated on state-of-the-art models including Whisper, Whisper-Turbo, and Qwen2-Audio.

Result: Empirical evaluation across multi-domain, multilingual, and multi-length datasets validates the framework and establishes practical protocols for improving domain-specific ASR performance while preventing overfitting.

Conclusion: The proposed framework effectively addresses domain adaptation challenges for both traditional and LLM-based ASR models, providing a systematic approach to fine-tuning that maintains performance while avoiding overfitting.

Abstract: Automatic Speech Recognition (ASR) models have achieved remarkable accuracy in general settings, yet their performance often degrades in domain-specific applications due to data mismatch and linguistic variability. This challenge is amplified for modern Large Language Model (LLM)-based ASR systems, whose massive scale and complex training dynamics make effective fine-tuning non-trivial. To address this gap, this paper proposes a principled and metric-driven fine-tuning framework for adapting both traditional and LLM-based ASR models to specialized domains. The framework emphasizes learning rate optimization based on performance metrics, combined with domain-specific data transformation and augmentation. We empirically evaluate our framework on state-of-the-art models, including Whisper, Whisper-Turbo, and Qwen2-Audio, across multi-domain, multilingual, and multi-length datasets. Our results not only validate the proposed framework but also establish practical protocols for improving domain-specific ASR performance while preventing overfitting.

[447] Chord Recognition with Deep Learning

Pierre Mackenzie

Main category: cs.SD

TL;DR: Thesis investigates slow progress in automatic chord recognition despite deep learning, identifies issues with rare chords, shows pitch augmentation helps, explores generative models, and improves interpretability with beat detection.

Details

Motivation: Progress in automatic chord recognition has been slow since the advent of deep learning, and the author wants to understand why by testing hypotheses enabled by recent developments in generative models.

Method: Conducted experiments on existing methods, tested hypotheses using recent generative models, explored pitch augmentation, evaluated features from generative models, used synthetic data, and improved interpretability with beat detection.

Result: Found that chord classifiers perform poorly on rare chords, pitch augmentation boosts accuracy, features from generative models don’t help, synthetic data shows promise, and beat detection improves interpretability while achieving some of the best results in the field.

Conclusion: Much work remains to solve automatic chord recognition, but this thesis charts a path for future research by identifying key issues and promising directions like synthetic data and improved interpretability.

Abstract: Progress in automatic chord recognition has been slow since the advent of deep learning in the field. To understand why, I conduct experiments on existing methods and test hypotheses enabled by recent developments in generative models. Findings show that chord classifiers perform poorly on rare chords and that pitch augmentation boosts accuracy. Features extracted from generative models do not help and synthetic data presents an exciting avenue for future work. I conclude by improving the interpretability of model outputs with beat detection, reporting some of the best results in the field and providing qualitative analysis. Much work remains to solve automatic chord recognition, but I hope this thesis will chart a path for others to try.

[448] Unrolled Creative Adversarial Network For Generating Novel Musical Pieces

Pratik Nag

Main category: cs.SD

TL;DR: This paper introduces two adversarial network systems for music generation: one for general music learning and another for composer-specific style deviation, using unrolled CAN to address mode collapse.

Details

Motivation: While RNNs are commonly used for music generation, GANs remain underexplored in this domain. The paper aims to explore adversarial networks for creative music generation, particularly focusing on learning from and deviating from specific composers' styles to create innovative music.

Method: Two systems: 1) General music piece learning without style differentiation, 2) Composer-specific style learning and deviation. Extends Creative Adversarial Networks (CAN) to music domain and introduces unrolled CAN to address mode collapse. Evaluates both GAN and CAN approaches.

Result: The paper presents adversarial network systems for music generation, with the second system specifically designed to learn and deviate from composers’ styles. The unrolled CAN approach addresses mode collapse issues common in GANs.

Conclusion: Adversarial networks show promise for music generation, particularly when extended with creative frameworks like CAN. The composer-specific style deviation approach enables innovative music creation beyond simple imitation.

Abstract: Music generation has emerged as a significant topic in artificial intelligence and machine learning. While recurrent neural networks (RNNs) have been widely employed for sequence generation, generative adversarial networks (GANs) remain relatively underexplored in this domain. This paper presents two systems based on adversarial networks for music generation. The first system learns a set of music pieces without differentiating between styles, while the second system focuses on learning and deviating from specific composers’ styles to create innovative music. By extending the Creative Adversarial Networks (CAN) framework to the music domain, this work introduces unrolled CAN to address mode collapse, evaluating both GAN and CAN in terms of creativity and variation.

[449] Mobile-Efficient Speech Emotion Recognition Using DistilHuBERT: A Cross-Corpus Validation Study

Saifelden M. Ismail

Main category: cs.SD

TL;DR: Mobile-efficient SER using distilled & quantized DistilHuBERT achieves 92% parameter reduction vs Wav2Vec 2.0 while maintaining 91% of baseline accuracy, enabling practical deployment on resource-constrained devices.

Details

Motivation: Speech Emotion Recognition has significant potential for mobile applications but deployment is constrained by computational demands of state-of-the-art transformer architectures.

Method: Uses DistilHuBERT (distilled and 8-bit quantized transformer) with rigorous 5-fold LOSO cross-validation on IEMOCAP for speaker independence, augmented with cross-corpus training on CREMA-D for generalization.

Result: Achieves 61.4% Unweighted Accuracy with only 23 MB model footprint; cross-corpus training yields 1.2% WA improvement, 1.4% Macro F1 gain, 32% variance reduction; cross-corpus evaluation reveals theatricality effect clustering predictions by arousal.

Conclusion: Establishes Pareto-optimal tradeoff between model size and accuracy, enabling practical affect recognition on mobile devices despite theatricality effects in acted emotion datasets.

Abstract: Speech Emotion Recognition (SER) has significant potential for mobile applications, yet deployment remains constrained by the computational demands of state-of-the-art transformer architectures. This paper presents a mobile-efficient SER system based on DistilHuBERT, a distilled and 8-bit quantized transformer that achieves 92% parameter reduction compared to full-scale Wav2Vec 2.0 models while maintaining competitive accuracy. We conduct a rigorous 5-fold Leave-One-Session-Out (LOSO) cross-validation on the IEMOCAP dataset to ensure speaker independence, augmented with cross-corpus training on CREMA-D to enhance generalization. Cross-corpus training with CREMA-D yields a 1.2% improvement in Weighted Accuracy, a 1.4% gain in Macro F1-score, and a 32% reduction in cross-fold variance, with the Neutral class showing the most substantial benefit at 5.4% F1-score improvement. Our approach achieves an Unweighted Accuracy of 61.4% with a quantized model footprint of only 23 MB, representing approximately 91% of full-scale baseline performance. Cross-corpus evaluation on RAVDESS reveals that the theatrical nature of acted emotions causes predictions to cluster by arousal level rather than valence: happiness is systematically confused with anger due to acoustic saturation in high-energy expressions. Despite this theatricality effect reducing overall RAVDESS accuracy to 43.29%, the model maintains robust arousal detection with 97% recall for anger and 64% for sadness. These findings establish a Pareto-optimal tradeoff between model size and accuracy, enabling practical affect recognition on resource-constrained mobile devices.

[450] Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

Zhixian Zhao, Xinfa Zhu, Xinsheng Wang, Shuiyuan Wang, Xuelong Geng, Wenjie Tian, Lei Xie

Main category: cs.SD

TL;DR: C²SER is a novel audio language model that improves speech emotion recognition stability and accuracy through contextual perception and chain-of-thought reasoning, outperforming existing ALMs like Qwen2-Audio.

Details

Motivation: Existing large-scale audio language models suffer from hallucinations and misclassifications in speech emotion recognition tasks, leading to unstable and inaccurate emotion classification.

Method: Combines Whisper encoder for semantic perception and Emotion2Vec-S (enhanced with semi-supervised learning) for acoustic perception. Uses chain-of-thought approach with step-by-step processing leveraging speech content and speaking styles. Introduces self-distillation from explicit to implicit CoT to reduce error accumulation.

Result: Extensive experiments show C²SER outperforms popular ALMs like Qwen2-Audio and SECap, delivering more stable and precise emotion recognition.

Conclusion: C²SER effectively addresses hallucination issues in ALMs for SER through contextual perception and CoT reasoning, providing a more reliable solution for emotion recognition. The authors release code, checkpoints, and test sets to support further research.

Abstract: Large-scale audio language models (ALMs), such as Qwen2-Audio, are capable of comprehending diverse audio signal, performing audio analysis and generating textual responses. However, in speech emotion recognition (SER), ALMs often suffer from hallucinations, resulting in misclassifications or irrelevant outputs. To address these challenges, we propose C$^2$SER, a novel ALM designed to enhance the stability and accuracy of SER through Contextual perception and Chain of Thought (CoT). C$^2$SER integrates the Whisper encoder for semantic perception and Emotion2Vec-S for acoustic perception, where Emotion2Vec-S extends Emotion2Vec with semi-supervised learning to enhance emotional discrimination. Additionally, C$^2$SER employs a CoT approach, processing SER in a step-by-step manner while leveraging speech content and speaking styles to improve recognition. To further enhance stability, C$^2$SER introduces self-distillation from explicit CoT to implicit CoT, mitigating error accumulation and boosting recognition accuracy. Extensive experiments show that C$^2$SER outperforms existing popular ALMs, such as Qwen2-Audio and SECap, delivering more stable and precise emotion recognition. We release the training code, checkpoints, and test sets to facilitate further research.

[451] SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering

Jan Melechovsky, Ambuj Mehrish, Abhinaba Roy, Dorien Herremans

Main category: cs.SD

TL;DR: SonicMaster is the first unified generative model for music restoration and mastering that uses text-based control to fix various audio quality issues like reverb, distortion, clipping, tonal imbalances, and stereo problems.

Details

Motivation: Music recordings often suffer from audio quality issues in non-professional settings, requiring separate specialized tools and manual adjustments. There's a need for a unified solution that can address multiple artifact types simultaneously.

Method: The authors create SonicMaster, a flow-matching generative model conditioned on natural language instructions. They build the SonicMaster dataset by simulating 19 degradation functions across 5 enhancement groups (equalization, dynamics, reverb, amplitude, stereo) to create paired degraded/high-quality tracks.

Result: Objective audio quality metrics show SonicMaster significantly improves sound quality across all artifact categories. Subjective listening tests confirm listeners prefer SonicMaster’s enhanced outputs over other baselines.

Conclusion: SonicMaster represents a breakthrough in unified music restoration and mastering, offering both text-guided targeted enhancements and automatic restoration capabilities through a single generative model.

Abstract: Music recordings often suffer from audio quality issues such as excessive reverberation, distortion, clipping, tonal imbalances, and a narrowed stereo image, especially when created in non-professional settings without specialized equipment or expertise. These problems are typically corrected using separate specialized tools and manual adjustments. In this paper, we introduce SonicMaster, the first unified generative model for music restoration and mastering that addresses a broad spectrum of audio artifacts with text-based control. SonicMaster is conditioned on natural language instructions to apply targeted enhancements, or can operate in an automatic mode for general restoration. To train this model, we construct the SonicMaster dataset, a large dataset of paired degraded and high-quality tracks by simulating common degradation types with nineteen degradation functions belonging to five enhancements groups: equalization, dynamics, reverb, amplitude, and stereo. Our approach leverages a flow-matching generative training paradigm to learn an audio transformation that maps degraded inputs to their cleaned, mastered versions guided by text prompts. Objective audio quality metrics demonstrate that SonicMaster significantly improves sound quality across all artifact categories. Furthermore, subjective listening tests confirm that listeners prefer SonicMaster’s enhanced outputs over other baselines.

[452] The CCF AATC 2025 Speech Restoration Challenge: A Retrospective

Junan Zhang, Mengyao Zhu, Xin Xu, Hui Bu, Zhenhua Ling, Zhizheng Wu

Main category: cs.SD

TL;DR: The CCF AATC 2025 Challenge focused on universal blind speech restoration requiring a single model to handle three distortion types: acoustic degradation, codec distortion, and secondary processing artifacts. Analysis of 25 systems revealed lightweight discriminative models outperform massive generative ones, generative models suffer from reconstruction bias and hallucination issues, and current metrics have poor correlation with human perception.

Details

Motivation: Real-world speech communication suffers from complex interplays of multiple degradations (acoustic interference, codec compression, and secondary artifacts from enhancement algorithms), creating a gap between academic research and realistic scenarios that needs to be addressed.

Method: Introduced the CCF AATC 2025 Challenge with dataset construction and task design for universal blind speech restoration. Conducted systematic analysis of 25 participating systems, including rank correlation analysis between metrics and human MOS.

Result: Three key findings: (1) Lightweight discriminative architectures (<10M parameters) achieve state-of-the-art performance over massive generative models; (2) Generative/hybrid models suffer from “reconstruction bias” in high-SNR codec tasks and hallucination with secondary artifacts; (3) Strong negative correlation (ρ=-0.8) between reference-free metrics (DNSMOS) and human MOS, indicating metrics over-reward artificial spectral smoothness.

Conclusion: The paper serves as a reference for future robust speech restoration research and calls for development of next-generation evaluation metrics sensitive to generative artifacts, highlighting the need for better alignment between computational metrics and human perception.

Abstract: Real-world speech communication is rarely affected by a single type of degradation. Instead, it suffers from a complex interplay of acoustic interference, codec compression, and, increasingly, secondary artifacts introduced by upstream enhancement algorithms. To bridge the gap between academic research and these realistic scenarios, we introduced the CCF AATC 2025 Challenge. This challenge targets universal blind speech restoration, requiring a single model to handle three distinct distortion categories: acoustic degradation, codec distortion, and secondary processing artifacts. In this paper, we provide a comprehensive retrospective of the challenge, detailing the dataset construction, task design, and a systematic analysis of the 25 participating systems. We report three key findings that define the current state of the field: (1) Efficiency vs. Scale: Contrary to the trend of massive generative models, top-performing systems demonstrated that lightweight discriminative architectures (<10M parameters) can achieve state-of-the-art performance, balancing restoration quality with deployment constraints. (2) Generative Trade-off: While generative and hybrid models excel in theoretical perceptual metrics, breakdown analysis reveals they suffer from “reconstruction bias” in high-SNR codec tasks and struggle with hallucination in complex secondary artifact scenarios. (3) Metric Gap: Most critically, our rank correlation analysis exposes a strong negative correlation (\r{ho}=-0.8) between widely-used reference-free metrics (e.g., DNSMOS) and human MOS when evaluating hybrid systems. This indicates that current metrics may over-reward artificial spectral smoothness at the expense of perceptual naturalness. This paper aims to serve as a reference for future research in robust speech restoration and calls for the development of next-generation evaluation metrics sensitive to generative artifacts.

[453] A Data-Centric Approach to Generalizable Speech Deepfake Detection

Wen Huang, Yuchen Mao, Yanmin Qian

Main category: cs.SD

TL;DR: This paper proposes a data-centric approach to improve speech deepfake detection by analyzing data composition through scaling laws and developing Diversity-Optimized Sampling Strategy (DOSS) for better data mixing.

Details

Motivation: Current speech deepfake detection models struggle with robust generalization to unseen forgery methods. While most research focuses on model and algorithm improvements, the impact of data composition is underexplored, creating a gap in understanding how data characteristics affect detection performance.

Method: The paper takes a data-centric approach with two perspectives: 1) Large-scale empirical study of data scaling laws to quantify impact of source and generator diversity, and 2) Proposed Diversity-Optimized Sampling Strategy (DOSS) with two implementations: DOSS-Select (pruning) and DOSS-Weight (re-weighting) for mixing heterogeneous data.

Result: DOSS-Select outperforms naive aggregation baseline using only 3% of total available data. The final model trained on 12k-hour curated data pool with DOSS-Weight achieves state-of-the-art performance, outperforming large-scale baselines with better data and model efficiency on public benchmarks and new challenge sets of commercial APIs.

Conclusion: Data-centric approaches, particularly through understanding data scaling laws and implementing principled data mixing strategies like DOSS, can significantly improve speech deepfake detection performance and generalization while being more data and model efficient than traditional approaches.

Abstract: Achieving robust generalization in speech deepfake detection (SDD) remains a primary challenge, as models often fail to detect unseen forgery methods. While research has focused on model-centric and algorithm-centric solutions, the impact of data composition is often underexplored. This paper proposes a data-centric approach, analyzing the SDD data landscape from two practical perspectives: constructing a single dataset and aggregating multiple datasets. To address the first perspective, we conduct a large-scale empirical study to characterize the data scaling laws for SDD, quantifying the impact of source and generator diversity. To address the second, we propose the Diversity-Optimized Sampling Strategy (DOSS), a principled framework for mixing heterogeneous data with two implementations: DOSS-Select (pruning) and DOSS-Weight (re-weighting). Our experiments show that DOSS-Select outperforms the naive aggregation baseline while using only 3% of the total available data. Furthermore, our final model, trained on a 12k-hour curated data pool using the optimal DOSS-Weight strategy, achieves state-of-the-art performance, outperforming large-scale baselines with greater data and model efficiency on both public benchmarks and a new challenge set of various commercial APIs.

cs.LG

[454] Pruning Graphs by Adversarial Robustness Evaluation to Strengthen GNN Defenses

Yongyu Wang

Main category: cs.LG

TL;DR: A pruning framework that uses adversarial robustness evaluation to identify and remove fragile graph components, improving GNN resilience against attacks and noise.

Details

Motivation: GNNs are vulnerable to adversarial attacks and noise because perturbations in graph structure or features get amplified through message passing, degrading model reliability.

Method: A pruning framework that leverages adversarial robustness scores to guide selective edge removal, identifying and eliminating fragile or detrimental graph components that degrade model resilience.

Result: Experimental results on benchmarks show the approach significantly enhances GNN defense capability, particularly in high-perturbation regimes, yielding cleaner and more resilient graph representations.

Conclusion: Using adversarial robustness evaluation to guide graph pruning effectively improves GNN resilience against attacks and noise, addressing a critical weakness in joint modeling of features and structure.

Abstract: Graph Neural Networks (GNNs) have emerged as a dominant paradigm for learning on graph-structured data, thanks to their ability to jointly exploit node features and relational information encoded in the graph topology. This joint modeling, however, also introduces a critical weakness: perturbations or noise in either the structure or the features can be amplified through message passing, making GNNs highly vulnerable to adversarial attacks and spurious connections. In this work, we introduce a pruning framework that leverages adversarial robustness evaluation to explicitly identify and remove fragile or detrimental components of the graph. By using robustness scores as guidance, our method selectively prunes edges that are most likely to degrade model reliability, thereby yielding cleaner and more resilient graph representations. We instantiate this framework on three representative GNN architectures and conduct extensive experiments on benchmarks. The experimental results show that our approach can significantly enhance the defense capability of GNNs in the high-perturbation regime.

[455] Towards Unsupervised Causal Representation Learning via Latent Additive Noise Model Causal Autoencoders

Hans Jarett J. Ong, Brian Godwin S. Lim, Dominic Dayta, Renzo Roel P. Tan, Kazushi Ikeda

Main category: cs.LG

TL;DR: LANCA uses Additive Noise Model as inductive bias for unsupervised causal discovery, transforming residual independence into explicit optimization objective via deterministic WAE architecture.

Details

Motivation: Standard unsupervised representation learning methods fail to capture causal dependencies due to identifiability issues. Disentangling causal variables from observational data is impossible without supervision, auxiliary signals, or strong inductive biases.

Method: Proposes Latent Additive Noise Model Causal Autoencoder (LANCA) that operationalizes ANM as inductive bias. Uses deterministic Wasserstein Auto-Encoder (WAE) coupled with differentiable ANM Layer instead of stochastic VAE encoding, making residual independence an explicit optimization objective rather than passive assumption.

Result: Theoretically proves ANM constraint resolves component-wise indeterminacy by restricting transformations from arbitrary diffeomorphisms to affine class. Empirically outperforms state-of-the-art baselines on synthetic physics benchmarks (Pendulum, Flow) and photorealistic environments (CANDLE), showing superior robustness to spurious correlations from complex backgrounds.

Conclusion: LANCA successfully operationalizes ANM as strong inductive bias for unsupervised causal discovery, addressing identifiability challenges by transforming residual independence from assumption to explicit optimization objective, demonstrating practical effectiveness across diverse benchmarks.

Abstract: Unsupervised representation learning seeks to recover latent generative factors, yet standard methods relying on statistical independence often fail to capture causal dependencies. A central challenge is identifiability: as established in disentangled representation learning and nonlinear ICA literature, disentangling causal variables from observational data is impossible without supervision, auxiliary signals, or strong inductive biases. In this work, we propose the Latent Additive Noise Model Causal Autoencoder (LANCA) to operationalize the Additive Noise Model (ANM) as a strong inductive bias for unsupervised discovery. Theoretically, we prove that while the ANM constraint does not guarantee unique identifiability in the general mixing case, it resolves component-wise indeterminacy by restricting the admissible transformations from arbitrary diffeomorphisms to the affine class. Methodologically, arguing that the stochastic encoding inherent to VAEs obscures the structural residuals required for latent causal discovery, LANCA employs a deterministic Wasserstein Auto-Encoder (WAE) coupled with a differentiable ANM Layer. This architecture transforms residual independence from a passive assumption into an explicit optimization objective. Empirically, LANCA outperforms state-of-the-art baselines on synthetic physics benchmarks (Pendulum, Flow), and on photorealistic environments (CANDLE), where it demonstrates superior robustness to spurious correlations arising from complex background scenes.

[456] SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models

Jiesong Lian, Ruizhe Zhong, Zixiang Zhou, Xiaoyue Mi, Yixue Hao, Yuan Zhou, Qinglin Lu, Long Hu, Junchi Yan

Main category: cs.LG

TL;DR: SoliReward: A systematic framework for training video reward models using single-item binary annotations, cross-prompt pairing, hierarchical attention, and modified loss to address labeling noise, architectural limitations, and reward hacking in video generation alignment.

Details

Motivation: Current video reward models face three main challenges: 1) Noisy pairwise annotations from in-prompt data collection, 2) Underexplored architectural designs for VLM-based RMs, particularly output mechanisms, and 3) Susceptibility to reward hacking during post-training alignment.

Method: Four key components: 1) Single-item binary annotations for cost-efficient data collection, 2) Cross-prompt pairing strategy to construct preference pairs, 3) Hierarchical Progressive Query Attention for enhanced feature aggregation, and 4) Modified Bradley-Terry loss that accommodates win-tie scenarios to regularize score distributions.

Result: Validated on benchmarks evaluating physical plausibility, subject deformity, and semantic alignment. Demonstrates improvements in both direct RM evaluation metrics and the efficacy of post-training on video generation models.

Conclusion: SoliReward provides a systematic framework that addresses key limitations in video reward model training, offering better data collection, architectural design, and regularization to improve alignment of video generation models with human preferences.

Abstract: Post-training alignment of video generation models with human preferences is a critical goal. Developing effective Reward Models (RMs) for this process faces significant methodological hurdles. Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise. Concurrently, the architectural design of VLM-based RMs, particularly their output mechanisms, remains underexplored. Furthermore, RM is susceptible to reward hacking in post-training. To mitigate these limitations, we propose SoliReward, a systematic framework for video RM training. Our framework first sources high-quality, cost-efficient data via single-item binary annotations, then constructs preference pairs using a cross-prompt pairing strategy. Architecturally, we employ a Hierarchical Progressive Query Attention mechanism to enhance feature aggregation. Finally, we introduce a modified BT loss that explicitly accommodates win-tie scenarios. This approach regularizes the RM’s score distribution for positive samples, providing more nuanced preference signals to alleviate over-focus on a small number of top-scoring samples. Our approach is validated on benchmarks evaluating physical plausibility, subject deformity, and semantic alignment, demonstrating improvements in direct RM evaluation metrics and in the efficacy of post-training on video generation models. Code and benchmark will be publicly available.

[457] Wireless Traffic Prediction with Large Language Model

Chuanting Zhang, Haixia Zhang, Jingping Qiao, Zongzhang Li, Mohamed-Slim Alouini

Main category: cs.LG

TL;DR: TIDES is an LLM-based framework for urban wireless traffic prediction that captures spatial-temporal correlations through region clustering, prompt engineering, and spatial alignment mechanisms, outperforming state-of-the-art baselines.

Details

Motivation: Existing deep learning and foundation models for wireless traffic prediction largely overlook spatial dependencies inherent in city-scale traffic dynamics, creating a need for models that can capture both temporal and spatial correlations for intelligent network management in 6G systems.

Method: TIDES uses: 1) clustering to identify heterogeneous traffic patterns across regions and train personalized models, 2) prompt engineering to embed statistical traffic features as structured inputs for LLMs, 3) DeepSeek module for spatial alignment via cross-domain attention, and 4) fine-tuning only lightweight components while freezing core LLM layers.

Result: Extensive experiments on real-world cellular traffic datasets show TIDES significantly outperforms state-of-the-art baselines in both prediction accuracy and robustness.

Conclusion: Integrating spatial awareness into LLM-based predictors is key to unlocking scalable and intelligent network management in future 6G systems, and TIDES demonstrates this through its effective spatial-temporal modeling approach.

Abstract: The growing demand for intelligent, adaptive resource management in next-generation wireless networks has underscored the importance of accurate and scalable wireless traffic prediction. While recent advancements in deep learning and foundation models such as large language models (LLMs) have demonstrated promising forecasting capabilities, they largely overlook the spatial dependencies inherent in city-scale traffic dynamics. In this paper, we propose TIDES (Traffic Intelligence with DeepSeek-Enhanced Spatial-temporal prediction), a novel LLM-based framework that captures spatial-temporal correlations for urban wireless traffic prediction. TIDES first identifies heterogeneous traffic patterns across regions through a clustering mechanism and trains personalized models for each region to balance generalization and specialization. To bridge the domain gap between numerical traffic data and language-based models, we introduce a prompt engineering scheme that embeds statistical traffic features as structured inputs. Furthermore, we design a DeepSeek module that enables spatial alignment via cross-domain attention, allowing the LLM to leverage information from spatially related regions. By fine-tuning only lightweight components while freezing core LLM layers, TIDES achieves efficient adaptation to domain-specific patterns without incurring excessive training overhead. Extensive experiments on real-world cellular traffic datasets demonstrate that TIDES significantly outperforms state-of-the-art baselines in both prediction accuracy and robustness. Our results indicate that integrating spatial awareness into LLM-based predictors is the key to unlocking scalable and intelligent network management in future 6G systems.

[458] Federated Multi-Task Clustering

S. Dai, G. Sun, F. Li, X. Tang, Q. Wang, Y. Cong

Main category: cs.LG

TL;DR: FMTC is a federated multi-task clustering framework that learns personalized clustering models for heterogeneous clients while capturing shared knowledge across clients in a privacy-preserving manner.

Details

Motivation: Existing spectral clustering models are centralized and inapplicable to decentralized environments. Current federated learning approaches suffer from poor generalization due to unreliable pseudo-labels and fail to capture correlations among heterogeneous clients.

Method: FMTC has two components: 1) Client-side personalized clustering module that learns parameterized mapping models for robust out-of-sample inference without pseudo-labels, and 2) Server-side tensorial correlation module that organizes client models into a tensor with low-rank regularization to discover common subspace. Uses ADMM-based distributed algorithm for privacy-preserving optimization.

Result: Extensive experiments on multiple real-world datasets show FMTC significantly outperforms various baseline and state-of-the-art federated clustering algorithms.

Conclusion: FMTC successfully addresses limitations of existing federated clustering by learning personalized models while capturing shared structure, achieving superior performance in decentralized environments with heterogeneous clients.

Abstract: Spectral clustering has emerged as one of the most effective clustering algorithms due to its superior performance. However, most existing models are designed for centralized settings, rendering them inapplicable in modern decentralized environments. Moreover, current federated learning approaches often suffer from poor generalization performance due to reliance on unreliable pseudo-labels, and fail to capture the latent correlations amongst heterogeneous clients. To tackle these limitations, this paper proposes a novel framework named Federated Multi-Task Clustering (i.e.,FMTC), which intends to learn personalized clustering models for heterogeneous clients while collaboratively leveraging their shared underlying structure in a privacy-preserving manner. More specifically, the FMTC framework is composed of two main components: client-side personalized clustering module, which learns a parameterized mapping model to support robust out-of-sample inference, bypassing the need for unreliable pseudo-labels; and server-side tensorial correlation module, which explicitly captures the shared knowledge across all clients. This is achieved by organizing all client models into a unified tensor and applying a low-rank regularization to discover their common subspace. To solve this joint optimization problem, we derive an efficient, privacy-preserving distributed algorithm based on the Alternating Direction Method of Multipliers, which decomposes the global problem into parallel local updates on clients and an aggregation step on the server. To the end, several extensive experiments on multiple real-world datasets demonstrate that our proposed FMTC framework significantly outperforms various baseline and state-of-the-art federated clustering algorithms.

[459] Latent Sculpting for Zero-Shot Generalization: A Manifold Learning Approach to Out-of-Distribution Anomaly Detection

Rajeeb Thapa Chhetri, Zhixiong Chen, Saurab Thapa

Main category: cs.LG

TL;DR: Latent Sculpting: A two-stage framework using topological constraints to prevent generalization collapse in anomaly detection by sculpting benign data into dense manifolds before density estimation.

Details

Motivation: Addresses "Generalization Collapse" where supervised models fail catastrophically on OOD data due to lack of topological constraints in latent space, leading to diffuse manifolds where anomalies remain indistinguishable from benign data.

Method: Two-stage hierarchical framework: Stage 1 uses hybrid 1D-CNN + Transformer Encoder with Dual-Centroid Compactness Loss (DCCL) to actively sculpt benign traffic into low-entropy hyperspherical clusters. Stage 2 conditions Masked Autoregressive Flow (MAF) on this pre-structured manifold for exact density estimation.

Result: Achieved F1-Score of 0.87 on zero-shot anomalies vs. supervised baselines (F1 ~0.30) and strongest unsupervised baseline (0.76). Notably achieved 88.89% detection rate on “Infiltration” scenarios where state-of-the-art supervised models had 0.00% accuracy.

Conclusion: Explicit manifold sculpting is prerequisite for robust zero-shot generalization. Decoupling structure learning from density estimation provides scalable path toward generalized anomaly detection, preventing catastrophic performance collapse on unseen distribution shifts.

Abstract: A fundamental limitation of supervised deep learning in high-dimensional tabular domains is “Generalization Collapse”: models learn precise decision boundaries for known distributions but fail catastrophically when facing Out-of-Distribution (OOD) data. We hypothesize that this failure stems from the lack of topological constraints in the latent space, resulting in diffuse manifolds where novel anomalies remain statistically indistinguishable from benign data. To address this, we propose Latent Sculpting, a hierarchical two-stage representation learning framework. Stage 1 utilizes a hybrid 1D-CNN and Transformer Encoder trained with a novel Dual-Centroid Compactness Loss (DCCL) to actively “sculpt” benign traffic into a low-entropy, hyperspherical cluster. Unlike standard contrastive losses that rely on triplet mining, DCCL optimizes global cluster centroids to enforce absolute manifold density. Stage 2 conditions a Masked Autoregressive Flow (MAF) on this pre-structured manifold to learn an exact density estimate. We evaluate this methodology on the rigorous CIC-IDS-2017 benchmark, treating it as a proxy for complex, non-stationary data streams. Empirical results demonstrate that explicit manifold sculpting is a prerequisite for robust zero-shot generalization. While supervised baselines suffered catastrophic performance collapse on unseen distribution shifts (F1 approx 0.30) and the strongest unsupervised baseline achieved only 0.76, our framework achieved an F1-Score of 0.87 on strictly zero-shot anomalies. Notably, we report an 88.89% detection rate on “Infiltration” scenarios–a complex distributional shift where state-of-the-art supervised models achieved 0.00% accuracy. These findings suggest that decoupling structure learning from density estimation provides a scalable path toward generalized anomaly detection.

[460] Learning Tennis Strategy Through Curriculum-Based Dueling Double Deep Q-Networks

Vishnu Mohan

Main category: cs.LG

TL;DR: A reinforcement learning framework using Dueling Double Deep Q-Network with curriculum learning achieves high win rates (98-100%) in tennis simulation but shows defensive bias prioritizing error avoidance over aggressive play.

Details

Motivation: Tennis strategy optimization is challenging due to hierarchical scoring, stochastic outcomes, long-horizon credit assignment, physical fatigue, and opponent adaptation. Existing approaches struggle with these complex sequential decision-making problems.

Method: Custom tennis simulation environment modeling complete scoring (points, games, sets), tactical decisions across 10 action categories, fatigue dynamics, and opponent skill. Uses Dueling Double Deep Q-Network (DDQN) with curriculum learning that progressively increases opponent difficulty from 0.40 to 0.50.

Result: Trained agent achieves 98-100% win rates against balanced opponents, with serve efficiency 63.0-67.5% and return efficiency 52.8-57.1%. Ablation studies show dueling architecture and curriculum learning are essential for stable convergence; standard DQN fails.

Conclusion: Despite strong performance, the learned policy shows defensive bias prioritizing error avoidance over aggressive point construction. This highlights limitations of win-rate optimization in simplified sports simulations and emphasizes the importance of reward design for realistic sports RL.

Abstract: Tennis strategy optimization is a challenging sequential decision-making problem involving hierarchical scoring, stochastic outcomes, long-horizon credit assignment, physical fatigue, and adaptation to opponent skill. I present a reinforcement learning framework that integrates a custom tennis simulation environment with a Dueling Double Deep Q-Network(DDQN) trained using curriculum learning. The environment models complete tennis scoring at the level of points, games, and sets, rally-level tactical decisions across ten discrete action categories, symmetric fatigue dynamics, and a continuous opponent skill parameter. The dueling architecture decomposes action-value estimation into state-value and advantage components, while double Q-learning reduces overestimation bias and improves training stability in this long-horizon stochastic domain. Curriculum learning progressively increases opponent difficulty from 0.40 to 0.50, enabling robust skill acquisition without the training collapse observed under fixed opponents. Across extensive evaluations, the trained agent achieves win rates between 98 and 100 percent against balanced opponents and maintains strong performance against more challenging opponents. Serve efficiency ranges from 63.0 to 67.5 percent, and return efficiency ranges from 52.8 to 57.1 percent. Ablation studies demonstrate that both the dueling architecture and curriculum learning are necessary for stable convergence, while a standard DQN baseline fails to learn effective policies. Despite strong performance, tactical analysis reveals a pronounced defensive bias, with the learned policy prioritizing error avoidance and prolonged rallies over aggressive point construction. These results highlight a limitation of win-rate driven optimization in simplified sports simulations and emphasize the importance of reward design for realistic sports reinforcement learning.

[461] Physics-Informed Machine Learning for Transformer Condition Monitoring – Part II: Physics-Informed Neural Networks and Uncertainty Quantification

Jose I. Aizpurua

Main category: cs.LG

TL;DR: This second paper in a two-part series focuses on integrating physics and uncertainty quantification into machine learning for transformer health assessment, covering PINNs for thermal/ageing modeling and Bayesian PINNs for uncertainty quantification.

Details

Motivation: To enhance transformer health monitoring by integrating physics-based knowledge with machine learning, addressing the need for robust predictions under sparse data conditions through uncertainty quantification.

Method: Introduces Physics-Informed Neural Networks (PINNs) for spatiotemporal thermal modeling and solid insulation ageing, then extends to Bayesian PINNs for epistemic uncertainty quantification.

Result: Presents a framework that combines physics-based modeling with uncertainty-aware machine learning for more reliable transformer health assessment, particularly in data-sparse scenarios.

Conclusion: Physics-aware and trustworthy machine learning shows significant potential for critical power asset monitoring, with Bayesian PINNs providing a principled approach to uncertainty quantification for robust diagnostics and prognostics.

Abstract: The integration of physics-based knowledge with machine learning models is increasingly shaping the monitoring, diagnostics, and prognostics of electrical transformers. In this two-part series, the first paper introduced the foundations of Neural Networks (NNs) and their variants for health assessment tasks. This second paper focuses on integrating physics and uncertainty into the learning process. We begin with the fundamentals of Physics-Informed Neural Networks (PINNs), applied to spatiotemporal thermal modeling and solid insulation ageing. Building on this, we present Bayesian PINNs as a principled framework to quantify epistemic uncertainty and deliver robust predictions under sparse data. Finally, we outline emerging research directions that highlight the potential of physics-aware and trustworthy machine learning for critical power assets.

[462] Physics-Informed Machine Learning for Transformer Condition Monitoring – Part I: Basic Concepts, Neural Networks, and Variants

Jose I. Aizpurua

Main category: cs.LG

TL;DR: This paper reviews Neural Networks and their extensions for transformer condition monitoring, covering CNNs for diagnostics and RL for control, with future research perspectives.

Details

Motivation: Traditional condition monitoring methods for power transformers struggle with uncertainty, limited data, and modern operating complexities, creating a need for ML-based approaches to improve diagnostics, prognostics, and control.

Method: The paper introduces Neural Networks basics, explores Convolutional Neural Networks for condition monitoring using diverse data modalities, and discusses integrating NN concepts within Reinforcement Learning for decision-making and control.

Result: The paper provides a comprehensive framework for applying NNs to transformer condition monitoring, establishing CNNs for diagnostics and RL for control as promising approaches to overcome traditional method limitations.

Conclusion: Neural Networks and their extensions offer powerful tools to enhance transformer condition monitoring and health management, with CNNs enabling better diagnostics and RL facilitating intelligent control, though further research is needed in emerging directions.

Abstract: Power transformers are critical assets in power networks, whose reliability directly impacts grid resilience and stability. Traditional condition monitoring approaches, often rule-based or purely physics-based, struggle with uncertainty, limited data availability, and the complexity of modern operating conditions. Recent advances in machine learning (ML) provide powerful tools to complement and extend these methods, enabling more accurate diagnostics, prognostics, and control. In this two-part series, we examine the role of Neural Networks (NNs) and their extensions in transformer condition monitoring and health management tasks. This first paper introduces the basic concepts of NNs, explores Convolutional Neural Networks (CNNs) for condition monitoring using diverse data modalities, and discusses the integration of NN concepts within the Reinforcement Learning (RL) paradigm for decision-making and control. Finally, perspectives on emerging research directions are also provided.

[463] Frequency Regularization: Unveiling the Spectral Inductive Bias of Deep Neural Networks

Jiahao Lu

Main category: cs.LG

TL;DR: Regularization techniques like L2 and Dropout act as low-pass filters in CNNs, suppressing high-frequency weight components and enforcing spectral bias toward low-frequency features, which creates an accuracy-robustness trade-off.

Details

Motivation: Despite being fundamental to deep learning, the physical mechanisms behind regularization techniques regarding feature frequency selection remain poorly understood. The paper aims to investigate how regularizers like L2 and Dropout affect the spectral properties of CNNs and their impact on generalization.

Method: Introduced a Visual Diagnostic Framework to track weight frequency evolution during training, proposed Spectral Suppression Ratio (SSR) metric to quantify low-pass filtering intensity, addressed aliasing issues in small kernels through discrete radial profiling, and conducted empirical studies on ResNet-18 with CIFAR-10.

Result: L2 regularization suppresses high-frequency energy accumulation by over 3x compared to unregularized baselines. L2 models show superior robustness against high-frequency information loss (e.g., low resolution) outperforming baselines by >6% in blurred scenarios, but are sensitive to broadband Gaussian noise due to over-specialization in low frequencies.

Conclusion: Regularization enforces a strong spectral inductive bias towards low-frequency structures in CNNs, providing a signal-processing perspective on generalization. This creates a critical accuracy-robustness trade-off where models become specialized for low-frequency features at the expense of sensitivity to certain noise types.

Abstract: Regularization techniques such as L2 regularization (Weight Decay) and Dropout are fundamental to training deep neural networks, yet their underlying physical mechanisms regarding feature frequency selection remain poorly understood. In this work, we investigate the Spectral Bias of modern Convolutional Neural Networks (CNNs). We introduce a Visual Diagnostic Framework to track the dynamic evolution of weight frequencies during training and propose a novel metric, the Spectral Suppression Ratio (SSR), to quantify the “low-pass filtering” intensity of different regularizers. By addressing the aliasing issue in small kernels (e.g., 3x3) through discrete radial profiling, our empirical results on ResNet-18 and CIFAR-10 demonstrate that L2 regularization suppresses high-frequency energy accumulation by over 3x compared to unregularized baselines. Furthermore, we reveal a critical Accuracy-Robustness Trade-off: while L2 models are sensitive to broadband Gaussian noise due to over-specialization in low frequencies, they exhibit superior robustness against high-frequency information loss (e.g., low resolution), outperforming baselines by >6% in blurred scenarios. This work provides a signal-processing perspective on generalization, confirming that regularization enforces a strong spectral inductive bias towards low-frequency structures.

[464] Fairness Evaluation of Risk Estimation Models for Lung Cancer Screening

Shaurya Gaur, Michel Vitale, Alessa Hering, Johan Kwisthout, Colin Jacobs, Lena Philipp, Fennie van der Graaf

Main category: cs.LG

TL;DR: AI lung cancer risk models show performance disparities across demographic groups, with Sybil performing better for women than men, and Venkadesh21 showing lower sensitivity for Black vs White participants at 90% specificity.

Details

Motivation: While AI models show potential for lung cancer risk estimation from LDCT scans, their performance across diverse demographic groups in high-risk populations remains uncertain, raising fairness concerns that need systematic evaluation.

Method: Used the JustEFAB framework to evaluate fairness and performance disparities in two deep learning models (Sybil and Venkadesh21) and the PanCan2b logistic regression model. Models trained on NLST data and assessed on held-out validation set, evaluating AUROC, sensitivity, and specificity across demographic subgroups while exploring clinical confounding factors.

Result: Significant AUROC difference in Sybil between women (0.88) and men (0.81, p<.001). Venkadesh21 showed lower sensitivity for Black (0.39) than White participants (0.69) at 90% specificity. Differences not explained by clinical confounders, suggesting unfair biases per JustEFAB.

Conclusion: AI lung cancer screening models exhibit demographic performance disparities that may constitute unfair biases, highlighting the need for improved model development, monitoring across underrepresented subgroups, and further research on algorithmic fairness in healthcare.

Abstract: Lung cancer is the leading cause of cancer-related mortality in adults worldwide. Screening high-risk individuals with annual low-dose CT (LDCT) can support earlier detection and reduce deaths, but widespread implementation may strain the already limited radiology workforce. AI models have shown potential in estimating lung cancer risk from LDCT scans. However, high-risk populations for lung cancer are diverse, and these models’ performance across demographic groups remains an open question. In this study, we drew on the considerations on confounding factors and ethically significant biases outlined in the JustEFAB framework to evaluate potential performance disparities and fairness in two deep learning risk estimation models for lung cancer screening: the Sybil lung cancer risk model and the Venkadesh21 nodule risk estimator. We also examined disparities in the PanCan2b logistic regression model recommended in the British Thoracic Society nodule management guideline. Both deep learning models were trained on data from the US-based National Lung Screening Trial (NLST), and assessed on a held-out NLST validation set. We evaluated AUROC, sensitivity, and specificity across demographic subgroups, and explored potential confounding from clinical risk factors. We observed a statistically significant AUROC difference in Sybil’s performance between women (0.88, 95% CI: 0.86, 0.90) and men (0.81, 95% CI: 0.78, 0.84, p < .001). At 90% specificity, Venkadesh21 showed lower sensitivity for Black (0.39, 95% CI: 0.23, 0.59) than White participants (0.69, 95% CI: 0.65, 0.73). These differences were not explained by available clinical confounders and thus may be classified as unfair biases according to JustEFAB. Our findings highlight the importance of improving and monitoring model performance across underrepresented subgroups, and further research on algorithmic fairness, in lung cancer screening.

[465] Emotion-Inspired Learning Signals (EILS): A Homeostatic Framework for Adaptive Autonomous Agents

Dhruv Tiwari

Main category: cs.LG

TL;DR: EILS framework replaces static reward functions with bio-inspired emotional signals for more robust, adaptive AI agents.

Details

Motivation: Current AI methods rely on static, externally defined reward functions that produce fragile agents in open-ended environments. These agents lack internal autonomy, struggle with exploration without dense feedback, fail to adapt to distribution shifts, and require extensive manual tuning.

Method: Introduces Emotion-Inspired Learning Signals (EILS), modeling emotions as continuous homeostatic appraisal signals (Curiosity, Stress, Confidence) derived from interaction history. These vector-valued internal states dynamically modulate the optimization landscape in real time.

Result: The paper hypothesizes that EILS agents will outperform standard baselines in sample efficiency and non-stationary adaptation through closed-loop homeostatic regulation.

Conclusion: Bio-inspired emotional signals can provide a unified internal feedback mechanism for more robust, autonomous AI agents that adapt better to open-ended, non-stationary environments.

Abstract: The ruling method in modern Artificial Intelligence spanning from Deep Reinforcement Learning (DRL) to Large Language Models (LLMs) relies on a surge of static, externally defined reward functions. While this “extrinsic maximization” approach has rendered superhuman performance in closed, stationary fields, it produces agents that are fragile in open-ended, real-world environments. Standard agents lack internal autonomy: they struggle to explore without dense feedback, fail to adapt to distribution shifts (non-stationarity), and require extensive manual tuning of static hyperparameters. This paper proposes that the unaddressed factor in robust autonomy is a functional analog to biological emotion, serving as a high-level homeostatic control mechanism. We introduce Emotion-Inspired Learning Signals (EILS), a unified framework that replaces scattered optimization heuristics with a coherent, bio-inspired internal feedback engine. Unlike traditional methods that treat emotions as semantic labels, EILS models them as continuous, homeostatic appraisal signals such as Curiosity, Stress, and Confidence. We formalize these signals as vector-valued internal states derived from interaction history. These states dynamically modulate the agent’s optimization landscape in real time: curiosity regulates entropy to prevent mode collapse, stress modulates plasticity to overcome inactivity, and confidence adapts trust regions to stabilize convergence. We hypothesize that this closed-loop homeostatic regulation can enable EILS agents to outperform standard baselines in terms of sample efficiency and non-stationary adaptation.

[466] Transformer Reconstructed with Dynamic Value Attention

Xiaowei Wang

Main category: cs.LG

TL;DR: The paper proposes Dynamic Value Attention (DVA), a single-head attention mechanism that dynamically determines values for each query, eliminating redundant multi-head attention and feed-forward networks while improving learning capability and reducing training time.

Details

Motivation: Transformers have a fundamental limitation: they use the same static value for every query within each attention head. While multi-head attention attempts to address this, the number of heads is limited by computational complexity, leaving the core problem unsolved.

Method: The author proposes Dynamic Value Attention (DVA), which dynamically determines appropriate values for each query rather than using static values. This allows for eliminating redundant attention heads (keeping only one) and completely removing the feed-forward network, as each revised embedding already captures sufficient useful information beyond the immediate context.

Result: DVA achieves 37.6% faster training time compared to the original transformer while simultaneously increasing learning capability. The single-head architecture with dynamic value assignment proves sufficient for effective attention mechanisms.

Conclusion: A single-head Dynamic Value Attention (DVA) is all that’s needed in a transformer, providing both computational efficiency and improved performance by addressing the fundamental limitation of static value assignment in traditional attention mechanisms.

Abstract: Since transformer was firstly published in 2017, several works have been proposed to optimize it. However, the major structure of transformer remains unchanged, ignoring one of its main intrinsic limitations, which is the same static value is used for every query in a head. Transformer itself tries to solve this problem by implementing multi-head attentions, yet the number of heads is limited by complexity. I propose a method to decide a value for each query dynamically, which could cut down all the redundant heads, keeping only one. Consequently, the following feed forward network could be cut down entirely, as each revised embedding has already fetched enough useful values far beyond the context. As a result, a single-head Dynamic Value Attention (DVA) is all you need in a transformer. According to the experiment, DVA may save 37.6% training time than the original transformer meanwhile increasing the learning capability.

[467] Doctor Sun: A Bilingual Multimodal Large Language Model for Biomedical AI

Dong Xue, Ziyao Shao, Zhaoyang Duan, Fangzhou Liu, Bing Li, Zhongheng Zhang

Main category: cs.LG

TL;DR: Doctor Sun is a specialized medical multimodal model that integrates vision and language capabilities for biomedical tasks, addressing limitations of existing models in understanding complex medical concepts and relationships between text and images.

Details

Motivation: Existing biomedical AI models have two main limitations: 1) they're based on foundation LLMs that struggle with intricate medical concepts due to limited medical training data, and 2) current medical LMMs fail to effectively capture complex relationships between medical texts and images.

Method: Doctor Sun integrates a pre-trained vision encoder with a medical LLM and uses two-stage training: feature alignment followed by instruction tuning on various medical datasets. The team also releases SunMed-VL, a bilingual medical multimodal dataset.

Result: The paper introduces Doctor Sun as a specialized medical multimodal generative model capable of encoding, integrating, and interpreting diverse biomedical data modalities including text and images. All models, code, and resources are made publicly available.

Conclusion: Doctor Sun addresses critical gaps in medical multimodal AI by providing a specialized model that better understands medical concepts and text-image relationships, while also contributing to the research community through open-source resources and datasets.

Abstract: Large multimodal models (LMMs) have demonstrated significant potential in providing innovative solutions for various biomedical tasks, including pathology analysis, radiology report generation, and biomedical assistance. However, the existing multimodal biomedical AI is typically based on foundation LLMs, thus hindering the understanding of intricate medical concepts with limited medical training data. Moreover, recent LLaVA-induced medical LMMs struggle to effectively capture the intricate relationship between the texts and the images. Therefore, we introduce Doctor Sun, a large multimodal generative model specialized in medicine, developed to encode, integrate, and interpret diverse biomedical data modalities such as text and images. In particular, Doctor Sun integrates a pre-trained vision encoder with a medical LLM and conducts two-stage training on various medical datasets, focusing on feature alignment and instruction tuning. Moreover, we release SunMed-VL, a wide-range bilingual medical multimodal dataset, along with all associated models, code, and resources, to freely support the advancement of biomedical multimodal research.

[468] On the Existence and Behaviour of Secondary Attention Sinks

Jeffrey T. H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu, Yiren Zhao

Main category: cs.LG

TL;DR: The paper identifies “secondary attention sinks” that differ from previously studied primary sinks, appearing in middle layers with variable persistence and drawing moderate attention mass, formed by specific MLP modules that align with primary sink directions.

Details

Motivation: To understand the existence and properties of secondary attention sinks that differ fundamentally from primary sinks (like BOS tokens), which have been the focus of prior research on attention mechanisms in transformers.

Method: Conducted extensive experiments across 11 model families, analyzing where secondary sinks appear, their properties, formation mechanisms, and impact on attention. Investigated MLP modules’ role in creating these sinks and their alignment with primary sink directions.

Result: Found that secondary sinks: (1) are formed by specific middle-layer MLPs that map tokens to vectors aligned with primary sink directions; (2) their L2-norm determines sink score and persistence duration; (3) primary sinks weaken in middle layers coinciding with secondary sink emergence; (4) larger models show more deterministic sink patterns with multiple “sink levels” (3 in QwQ-32B, 6 in Qwen3-14B).

Conclusion: Secondary attention sinks represent a distinct class of attention phenomena that emerge in middle transformer layers through specific MLP mechanisms, with systematic patterns that become more deterministic in larger-scale models, revealing new insights about attention dynamics beyond primary sinks.

Abstract: Attention sinks are tokens, often the beginning-of-sequence (BOS) token, that receive disproportionately high attention despite limited semantic relevance. In this work, we identify a class of attention sinks, which we term secondary sinks, that differ fundamentally from the sinks studied in prior works, which we term primary sinks. While prior works have identified that tokens other than BOS can sometimes become sinks, they were found to exhibit properties analogous to the BOS token. Specifically, they emerge at the same layer, persist throughout the network and draw a large amount of attention mass. Whereas, we find the existence of secondary sinks that arise primarily in middle layers and can persist for a variable number of layers, and draw a smaller, but still significant, amount of attention mass. Through extensive experiments across 11 model families, we analyze where these secondary sinks appear, their properties, how they are formed, and their impact on the attention mechanism. Specifically, we show that: (1) these sinks are formed by specific middle-layer MLP modules; these MLPs map token representations to vectors that align with the direction of the primary sink of that layer. (2) The $\ell_2$-norm of these vectors determines the sink score of the secondary sink, and also the number of layers it lasts for, thereby leading to different impacts on the attention mechanisms accordingly. (3) The primary sink weakens in middle layers, coinciding with the emergence of secondary sinks. We observe that in larger-scale models, the location and lifetime of the sinks, together referred to as sink levels, appear in a more deterministic and frequent manner. Specifically, we identify three sink levels in QwQ-32B and six levels in Qwen3-14B.

[469] Interpretable and Adaptive Node Classification on Heterophilic Graphs via Combinatorial Scoring and Hybrid Learning

Soroush Vahidi

Main category: cs.LG

TL;DR: The paper proposes an interpretable, adaptive framework for semi-supervised node classification that uses combinatorial inference instead of deep message passing, with a hybrid neural refinement option for improved performance on both homophilic and heterophilic graphs.

Details

Motivation: Graph neural networks (GNNs) perform well on homophilic graphs but struggle with heterophily (where adjacent nodes belong to different classes). There's a need for more interpretable and adaptive methods that can handle both regimes without relying on deep message passing architectures.

Method: An interpretable combinatorial inference framework using a confidence-ordered greedy procedure driven by an additive scoring function that integrates class priors, neighborhood statistics, feature similarity, and training-derived label-label compatibility. Includes a validation-gated hybrid strategy where combinatorial predictions can be injected as priors into a lightweight neural model only when it improves validation performance.

Result: The method demonstrates competitive performance with modern GNNs on heterophilic and transitional benchmarks while offering advantages in interpretability, tunability, and computational efficiency.

Conclusion: The proposed framework provides an effective alternative to deep GNNs for node classification, particularly for heterophilic graphs, with the benefits of interpretability, adaptability, and efficiency while maintaining competitive accuracy.

Abstract: Graph neural networks (GNNs) achieve strong performance on homophilic graphs but often struggle under heterophily, where adjacent nodes frequently belong to different classes. We propose an interpretable and adaptive framework for semi-supervised node classification based on explicit combinatorial inference rather than deep message passing. Our method assigns labels using a confidence-ordered greedy procedure driven by an additive scoring function that integrates class priors, neighborhood statistics, feature similarity, and training-derived label-label compatibility. A small set of transparent hyperparameters controls the relative influence of these components, enabling smooth adaptation between homophilic and heterophilic regimes. We further introduce a validation-gated hybrid strategy in which combinatorial predictions are optionally injected as priors into a lightweight neural model. Hybrid refinement is applied only when it improves validation performance, preserving interpretability when neuralization is unnecessary. All adaptation signals are computed strictly from training data, ensuring a leakage-free evaluation protocol. Experiments on heterophilic and transitional benchmarks demonstrate competitive performance with modern GNNs while offering advantages in interpretability, tunability, and computational efficiency.

[470] Müntz-Szász Networks: Neural Architectures with Learnable Power-Law Bases

Gnankan Landry Regis N’guessan

Main category: cs.LG

TL;DR: MSN replaces fixed activation functions with learnable fractional power bases, achieving superior approximation for singular functions common in physics.

Details

Motivation: Standard neural networks with fixed activation functions (ReLU, tanh, sigmoid) are poorly suited for approximating functions with singular or fractional power behavior that arises ubiquitously in physics (boundary layers, fracture mechanics, corner singularities).

Method: Introduces Müntz-Szász Networks (MSN) that replace fixed smooth activations with learnable fractional power bases: φ(x) = Σ a_k |x|^{μ_k} + Σ b_k sign(x)|x|^{λ_k}, where exponents {μ_k, λ_k} are learned alongside coefficients.

Result: MSN achieves 5-8x lower error than MLPs with 10x fewer parameters on singular target functions. On PINN benchmarks (singular ODE and stiff boundary-layer problems), MSN achieves 3-6x improvement while learning interpretable exponents matching known solution structure.

Conclusion: Theory-guided architectural design can yield dramatic improvements for scientifically-motivated function classes, with MSN inheriting universal approximation from Müntz-Szász theorem and establishing superior approximation rates for singular functions.

Abstract: Standard neural network architectures employ fixed activation functions (ReLU, tanh, sigmoid) that are poorly suited for approximating functions with singular or fractional power behavior, a structure that arises ubiquitously in physics, including boundary layers, fracture mechanics, and corner singularities. We introduce Müntz-Szász Networks (MSN), a novel architecture that replaces fixed smooth activations with learnable fractional power bases grounded in classical approximation theory. Each MSN edge computes $φ(x) = \sum_k a_k |x|^{μ_k} + \sum_k b_k \mathrm{sign}(x)|x|^{λ_k}$, where the exponents ${μ_k, λ_k}$ are learned alongside the coefficients. We prove that MSN inherits universal approximation from the Müntz-Szász theorem and establish novel approximation rates: for functions of the form $|x|^α$, MSN achieves error $\mathcal{O}(|μ- α|^2)$ with a single learned exponent, whereas standard MLPs require $\mathcal{O}(ε^{-1/α})$ neurons for comparable accuracy. On supervised regression with singular target functions, MSN achieves 5-8x lower error than MLPs with 10x fewer parameters. Physics-informed neural networks (PINNs) represent a particularly demanding application for singular function approximation; on PINN benchmarks including a singular ODE and stiff boundary-layer problems, MSN achieves 3-6x improvement while learning interpretable exponents that match the known solution structure. Our results demonstrate that theory-guided architectural design can yield dramatic improvements for scientifically-motivated function classes.

[471] ReGAIN: Retrieval-Grounded AI Framework for Network Traffic Analysis

Shaghayegh Shajarian, Kennedy Marsh, James Benson, Sajad Khorsandroo, Mahmoud Abdelsalam

Main category: cs.LG

TL;DR: ReGAIN is a multi-stage framework combining traffic summarization, RAG, and LLM reasoning for transparent network traffic analysis, achieving 95.95-98.82% accuracy on real-world attack traces while providing explainable results.

Details

Motivation: Traditional network traffic analysis systems suffer from high false positives and lack interpretability, limiting analyst trust. There's a need for transparent and accurate analysis of vast, heterogeneous network traffic for security and performance monitoring.

Method: ReGAIN uses a multi-stage framework: 1) Creates natural-language summaries from network traffic, 2) Embeds them into a multi-collection vector database, 3) Uses hierarchical retrieval pipeline with metadata filtering, MMR sampling, two-stage cross-encoder reranking, and abstention mechanism to ground LLM responses with evidence citations.

Result: Achieves robust performance with accuracy between 95.95% and 98.82% across different attack types (ICMP ping flood and TCP SYN flood) on real-world traffic dataset. Validated against dataset ground truth and human expert assessments. Outperforms rule-based, classical ML, and deep learning baselines.

Conclusion: ReGAIN provides transparent and accurate network traffic analysis with unique explainability through trustworthy, verifiable responses, addressing limitations of traditional systems while maintaining high accuracy.

Abstract: Modern networks generate vast, heterogeneous traffic that must be continuously analyzed for security and performance. Traditional network traffic analysis systems, whether rule-based or machine learning-driven, often suffer from high false positives and lack interpretability, limiting analyst trust. In this paper, we present ReGAIN, a multi-stage framework that combines traffic summarization, retrieval-augmented generation (RAG), and Large Language Model (LLM) reasoning for transparent and accurate network traffic analysis. ReGAIN creates natural-language summaries from network traffic, embeds them into a multi-collection vector database, and utilizes a hierarchical retrieval pipeline to ground LLM responses with evidence citations. The pipeline features metadata-based filtering, MMR sampling, a two-stage cross-encoder reranking mechanism, and an abstention mechanism to reduce hallucinations and ensure grounded reasoning. Evaluated on ICMP ping flood and TCP SYN flood traces from the real-world traffic dataset, it demonstrates robust performance, achieving accuracy between 95.95% and 98.82% across different attack types and evaluation benchmarks. These results are validated against two complementary sources: dataset ground truth and human expert assessments. ReGAIN also outperforms rule-based, classical ML, and deep learning baselines while providing unique explainability through trustworthy, verifiable responses.

[472] DiRL: An Efficient Post-Training Framework for Diffusion Language Models

Ying Zhu, Jiaxin Wan, Xiaoran Liu, Siyanag He, Qiqi Wang, Xu Guo, Tianyi Liang, Zengfeng Huang, Ziwei He, Xipeng Qiu

Main category: cs.LG

TL;DR: DiRL is an efficient post-training framework for Diffusion Language Models that integrates accelerated training with optimized inference, enabling effective fine-tuning for complex reasoning tasks like mathematics.

Details

Motivation: Existing post-training methods for Diffusion Language Models suffer from computational inefficiency and objective mismatches between training and inference, limiting performance on complex reasoning tasks.

Method: DiRL integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference for efficient online model updates, enabling two-stage post-training (Supervised Fine-Tuning + Reinforcement Learning). DiPO provides unbiased Group Relative Policy Optimization tailored for dLLMs.

Result: DiRL-8B-Instruct achieves state-of-the-art math performance among dLLMs and surpasses comparable Qwen2.5 series models on several benchmarks.

Conclusion: DiRL provides an effective post-training framework for Diffusion Language Models that addresses computational inefficiency and objective mismatch issues, enabling strong performance on complex reasoning tasks.

Abstract: Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models. While recent efforts have validated their pre-training potential and accelerated inference speeds, the post-training landscape for dLLMs remains underdeveloped. Existing methods suffer from computational inefficiency and objective mismatches between training and inference, severely limiting performance on complex reasoning tasks such as mathematics. To address this, we introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference. This architecture enables a streamlined online model update loop, facilitating efficient two-stage post-training (Supervised Fine-Tuning followed by Reinforcement Learning). Building on this framework, we propose DiPO, the first unbiased Group Relative Policy Optimization (GRPO) implementation tailored for dLLMs. We validate our approach by training DiRL-8B-Instruct on high-quality math data. Our model achieves state-of-the-art math performance among dLLMs and surpasses comparable models in the Qwen2.5 series on several benchmarks.

[473] Masking Teacher and Reinforcing Student for Distilling Vision-Language Models

Byung-Kwan Lee, Yu-Chiang Frank Wang, Ryo Hachiuma

Main category: cs.LG

TL;DR: Masters is a mask-progressive reinforcement learning framework for distilling knowledge from large vision-language teachers to compact student models by masking non-dominant teacher weights and using offline RL with dual rewards.

Details

Motivation: Large VLMs are impractical for mobile/edge deployment due to size, but distilling knowledge from large teachers to small students is challenging due to the size gap causing unstable learning and degraded performance.

Method: Masters uses mask-progressive RL distillation: 1) masks non-dominant teacher weights to reduce complexity, 2) progressively restores teacher capacity during training, 3) offline RL stage with accuracy and distillation rewards using pre-generated responses from masked teachers.

Result: The method enables students to learn richer representations smoothly and achieve strong performance without requiring computationally expensive think-answer processes.

Conclusion: Masters provides an efficient framework for distilling knowledge from large VLMs to compact models, addressing the size gap challenge through progressive masking and offline reinforcement learning.

Abstract: Large-scale vision-language models (VLMs) have recently achieved remarkable multimodal understanding, but their massive size makes them impractical for deployment on mobile or edge devices. This raises the need for compact yet capable VLMs that can efficiently learn from powerful large teachers. However, distilling knowledge from a large teacher to a small student remains challenging due to their large size gap: the student often fails to reproduce the teacher’s complex, high-dimensional representations, leading to unstable learning and degraded performance. To address this, we propose Masters (Masking Teacher and Reinforcing Student), a mask-progressive reinforcement learning (RL) distillation framework. Masters first masks non-dominant weights of the teacher to reduce unnecessary complexity, then progressively restores the teacher by gradually increasing its capacity during training. This strategy allows the student to learn richer representations from the teacher in a smooth and stable manner. To further refine knowledge transfer, Masters integrates an offline RL stage with two complementary rewards: an accuracy reward that measures the correctness of the generated responses, and a distillation reward that quantifies the ease of transferring responses from teacher to student. Unlike online think-answer RL paradigms that are computationally expensive and generate lengthy responses, our offline RL leverages pre-generated responses from masked teachers. These provide rich yet efficient guidance, enabling students to achieve strong performance without requiring the think-answer process.

[474] KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta

Gang Liao, Hongsen Qin, Ying Wang, Alicia Golden, Michael Kuchnik, Yavuz Yetim, Jia Jiunn Ang, Chunli Fu, Yihan He, Samuel Hsia, Zewei Jiang, Dianshi Li, Uladzimir Pashkevich, Varna Puvvada, Feng Shi, Matt Steiner, Ruichao Xiao, Nathan Yan, Xiayu Yu, Zhou Fang, Abdul Zainul-Abedin, Ketan Singh, Hongtao Yu, Wenyuan Chi, Barney Huang, Sean Zhang, Noah Weller, Zach Marine, Wyatt Cook, Carole-Jean Wu, Gaoxiang Liu

Main category: cs.LG

TL;DR: KernelEvolve is an agentic kernel coding framework that automates kernel generation and optimization for DLRM across heterogeneous hardware, reducing development time from weeks to hours while achieving substantial performance improvements.

Details

Motivation: Deep learning recommendation models face three key system challenges: model architecture diversity, kernel primitive diversity, and hardware heterogeneity. These challenges make DLRM training and inference optimization difficult across different hardware platforms.

Method: KernelEvolve uses an agentic framework that takes kernel specifications as input and automates kernel generation/optimization across heterogeneous hardware. It operates at multiple programming abstractions (Triton, CuTe DSL to low-level hardware-agnostic languages) and uses graph-based search with selection policy, universal operator, fitness function, and termination rule, enhanced by retrieval-augmented prompt synthesis.

Result: Achieved 100% pass rate on all 250 KernelBench problems across three difficulty levels, and 160 PyTorch ATen operators across three heterogeneous hardware platforms with 100% correctness. Reduced development time from weeks to hours and achieved substantial performance improvements over PyTorch baselines across production use cases.

Conclusion: KernelEvolve successfully addresses DLRM heterogeneity challenges at scale, significantly improves performance efficiency, and reduces programmability barriers for new AI hardware by enabling automated kernel generation for in-house developed accelerators.

Abstract: Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges - model architecture diversity, kernel primitive diversity, and hardware generation and architecture heterogeneity. This paper presents KernelEvolve-an agentic kernel coding framework-to tackle heterogeneity at-scale for DLRM. KernelEvolve is designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures. KernelEvolve does so by operating at multiple programming abstractions, from Triton and CuTe DSL to low-level hardware agnostic languages, spanning the full hardware-software optimization stack. The kernel optimization process is described as graph-based search with selection policy, universal operator, fitness function, and termination rule, dynamically adapts to runtime execution context through retrieval-augmented prompt synthesis. We designed, implemented, and deployed KernelEvolve to optimize a wide variety of production recommendation models across generations of NVIDIA and AMD GPUs, as well as Meta’s AI accelerators. We validate KernelEvolve on the publicly-available KernelBench suite, achieving 100% pass rate on all 250 problems across three difficulty levels, and 160 PyTorch ATen operators across three heterogeneous hardware platforms, demonstrating 100% correctness. KernelEvolve reduces development time from weeks to hours and achieves substantial performance improvements over PyTorch baselines across diverse production use cases and for heterogeneous AI systems at-scale. Beyond performance efficiency improvements, KernelEvolve significantly mitigates the programmability barrier for new AI hardware by enabling automated kernel generation for in-house developed AI hardware.

[475] Graph Neural Networks with Transformer Fusion of Brain Connectivity Dynamics and Tabular Data for Forecasting Future Tobacco Use

Runzhi Zhou, Xi Luo

Main category: cs.LG

TL;DR: GNN-TF model integrates non-Euclidean brain imaging data with Euclidean tabular data for forecasting future outcomes in longitudinal studies, outperforming existing methods in predicting tobacco usage.

Details

Motivation: Integrating non-Euclidean brain imaging data (like dynamic brain connectivity) with Euclidean tabular data (clinical/demographic information) is challenging for medical imaging analysis, especially for forecasting future outcomes in longitudinal studies.

Method: Time-aware graph neural network model with transformer fusion (GNN-TF) that flexibly integrates both tabular data and dynamic brain connectivity data, leveraging temporal order within a coherent framework.

Result: GNN-TF outperforms various established machine learning and deep learning models, delivering superior predictive accuracy for predicting future tobacco usage using longitudinal resting-state fMRI data from NCANDA.

Conclusion: The end-to-end, time-aware transformer fusion structure successfully integrates multiple data modalities and leverages temporal dynamics, making GNN-TF a valuable analytic tool for functional brain imaging studies focused on clinical outcome prediction.

Abstract: Integrating non-Euclidean brain imaging data with Euclidean tabular data, such as clinical and demographic information, poses a substantial challenge for medical imaging analysis, particularly in forecasting future outcomes. While machine learning and deep learning techniques have been applied successfully to cross-sectional classification and prediction tasks, effectively forecasting outcomes in longitudinal imaging studies remains challenging. To address this challenge, we introduce a time-aware graph neural network model with transformer fusion (GNN-TF). This model flexibly integrates both tabular data and dynamic brain connectivity data, leveraging the temporal order of these variables within a coherent framework. By incorporating non-Euclidean and Euclidean sources of information from a longitudinal resting-state fMRI dataset from the National Consortium on Alcohol and Neurodevelopment in Adolescence (NCANDA), the GNN-TF enables a comprehensive analysis that captures critical aspects of longitudinal imaging data. Comparative analyses against a variety of established machine learning and deep learning models demonstrate that GNN-TF outperforms these state-of-the-art methods, delivering superior predictive accuracy for predicting future tobacco usage. The end-to-end, time-aware transformer fusion structure of the proposed GNN-TF model successfully integrates multiple data modalities and leverages temporal dynamics, making it a valuable analytic tool for functional brain imaging studies focused on clinical outcome prediction.

[476] EvoXplain: When Machine Learning Models Agree on Predictions but Disagree on Why – Measuring Mechanistic Multiplicity Across Training Runs

Chama Bensmail

Main category: cs.LG

TL;DR: EvoXplain reveals that high-accuracy ML models can achieve the same predictions through different internal mechanisms, showing explanatory instability even in supposedly stable models like Logistic Regression.

Details

Motivation: The paper challenges the assumption that high predictive accuracy implies correct and trustworthy explanations, questioning whether models achieving similar accuracy rely on the same internal logic or different competing mechanisms.

Method: EvoXplain treats explanations as samples from stochastic optimization processes across repeated training runs, analyzing whether they form coherent explanations or separate into multiple distinct explanatory modes without aggregating predictions or constructing ensembles.

Result: On Breast Cancer and COMPAS datasets with Logistic Regression and Random Forests, explanations frequently exhibit clear multimodality despite high accuracy. Even Logistic Regression produces multiple well-separated explanatory basins under repeated training on the same data split.

Conclusion: EvoXplain reframes interpretability as a property of model classes under repeated instantiation rather than single trained models, making explanatory instability visible and quantifiable when single-instance or averaged explanations obscure multiple underlying mechanisms.

Abstract: Machine learning models are primarily judged by predictive performance, especially in applied settings. Once a model reaches high accuracy, its explanation is often assumed to be correct and trustworthy. However, this assumption raises an overlooked question: when two models achieve high accuracy, do they rely on the same internal logic, or do they reach the same outcome via different – and potentially competing – mechanisms? We introduce EvoXplain, a diagnostic framework that measures the stability of model explanations across repeated training. Rather than analysing a single trained model, EvoXplain treats explanations as samples drawn from the stochastic optimisation process itself – without aggregating predictions or constructing ensembles – and examines whether these samples form a single coherent explanation or separate into multiple, distinct explanatory modes. We evaluate EvoXplain on the Breast Cancer and COMPAS datasets using two widely deployed model classes: Logistic Regression and Random Forests. Although all models achieve high predictive accuracy, their explanations frequently exhibit clear multimodality. Even models commonly assumed to be stable, such as Logistic Regression, can produce multiple well-separated explanatory basins under repeated training on the same data split. These differences are not explained by hyperparameter variation or simple performance trade-offs. EvoXplain does not attempt to select a ‘correct’ explanation. Instead, it makes explanatory instability visible and quantifiable, revealing when single-instance or averaged explanations obscure the existence of multiple underlying mechanisms. More broadly, EvoXplain reframes interpretability as a property of a model class under repeated instantiation, rather than of any single trained model.

[477] The Law of Multi-Model Collaboration: Scaling Limits of Model Ensembling for Large Language Models

Dakuan Lu, Jiaqi Zhang, Cheng Yuan, Jiawei Shao, Chi Zhang, Xuelong Li

Main category: cs.LG

TL;DR: Multi-model LLM collaboration follows power-law scaling with total parameter count, achieving better performance than single models and benefiting from model diversity.

Details

Motivation: While single LLMs have scaling limits, multi-model collaboration can surpass individual capabilities, but lacks a theoretical framework to understand performance scaling in ensembles.

Method: Proposes Law of Multi-model Collaboration using method-agnostic formulation with idealized integration oracle, where loss is determined by minimum loss from any model in the pool.

Result: Multi-model systems follow power-law scaling with total parameters, show greater improvement and lower loss floor than single models, with heterogeneous ensembles outperforming homogeneous ones.

Conclusion: Model collaboration represents a critical axis for extending LLM intelligence frontier, with model diversity being a primary driver of collaboration gains.

Abstract: Recent advances in large language models (LLMs) have been largely driven by scaling laws for individual models, which predict performance improvements as model parameters and data volume increase. However, the capabilities of any single LLM are inherently bounded. One solution originates from intricate interactions among multiple LLMs, rendering their collective performance surpasses that of any constituent model. Despite the rapid proliferation of multi-model integration techniques such as model routing and post-hoc ensembling, a unifying theoretical framework of performance scaling for multi-model collaboration remains absent. In this work, we propose the Law of Multi-model Collaboration, a scaling law that predicts the performance limits of LLM ensembles based on their aggregated parameter budget. To quantify the intrinsic upper bound of multi-model collaboration, we adopt a method-agnostic formulation and assume an idealized integration oracle where the total cross-entropy loss of each sample is determined by the minimum loss of any model in the model pool. Experimental results reveal that multi-model systems follow a power-law scaling with respect to the total parameter count, exhibiting a more significant improvement trend and a lower theoretical loss floor compared to single model scaling. Moreover, ensembles of heterogeneous model families achieve better performance scaling than those formed within a single model family, indicating that model diversity is a primary driver of collaboration gains. These findings suggest that model collaboration represents a critical axis for extending the intelligence frontier of LLMs.

[478] Enhanced geometry prediction in laser directed energy deposition using meta-learning

Abdul Malik Al Mardhouf Al Saadi, Amrita Basak

Main category: cs.LG

TL;DR: Meta-learning approach (MAML & Reptile) enables accurate bead geometry prediction in laser-directed energy deposition with minimal data by transferring knowledge across heterogeneous experimental datasets.

Details

Motivation: Accurate bead geometry prediction in L-DED is challenging due to limited and heterogeneous experimental data from different materials, machine configurations, and process parameters.

Method: Proposed cross-dataset knowledge transfer model using gradient-based meta-learning algorithms (MAML and Reptile) that enable rapid adaptation to new deposition conditions with limited data, evaluated across powder-fed, wire-fed, and hybrid wire-powder L-DED processes.

Result: Both MAML and Reptile achieve accurate bead height predictions on unseen target tasks using only 3-9 training examples, outperforming conventional neural networks. Achieved R² up to ~0.9 and MAE between 0.03-0.08 mm across multiple target tasks.

Conclusion: Meta-learning enables effective knowledge transfer across heterogeneous L-DED settings, providing accurate predictions with minimal data, overcoming data scarcity challenges in additive manufacturing.

Abstract: Accurate bead geometry prediction in laser-directed energy deposition (L-DED) is often hindered by the scarcity and heterogeneity of experimental datasets collected under different materials, machine configurations, and process parameters. To address this challenge, a cross-dataset knowledge transfer model based on meta-learning for predicting deposited track geometry in L-DED is proposed. Specifically, two gradient-based meta-learning algorithms, i.e., Model-Agnostic Meta-Learning (MAML) and Reptile, are investigated to enable rapid adaptation to new deposition conditions with limited data. The proposed framework is performed using multiple experimental datasets compiled from peer-reviewed literature and in-house experiments and evaluated across powder-fed, wire-fed, and hybrid wire-powder L-DED processes. Results show that both MAML and Reptile achieve accurate bead height predictions on unseen target tasks using as few as three to nine training examples, consistently outperforming conventional feedforward neural networks trained under comparable data constraints. Across multiple target tasks representing different printing conditions, the meta-learning models achieve strong generalization performance, with R-squared values reaching up to approximately 0.9 and mean absolute errors between 0.03-0.08 mm, demonstrating effective knowledge transfer across heterogeneous L-DED settings.

[479] Predicting Mycotoxin Contamination in Irish Oats Using Deep and Transfer Learning

Alan Inglis, Fiona Doohan, Subramani Natarajan, Breige McNulty, Chris Elliott, Anne Nugent, Julie Meneely, Brett Greer, Stephen Kildea, Diana Bucur, Martin Danaher, Melissa Di Rocco, Lisa Black, Adam Gauley, Naoise McKenna, Andrew Parnell

Main category: cs.LG

TL;DR: Neural networks and transfer learning models were used to predict mycotoxin contamination in Irish oat crops, with TabPFN performing best overall, and weather patterns in the 90-day pre-harvest period identified as the most important predictors.

Details

Motivation: Mycotoxin contamination poses significant risks to cereal crop quality, food safety, and agricultural productivity. Accurate prediction can enable early intervention strategies and reduce economic losses.

Method: Used neural networks and transfer learning models (MLP baseline, MLP with pre-training, TabPFN, TabNet, FT-Transformer) to predict mycotoxin contamination as a multi-response task using environmental, agronomic, and geographical data from Irish oat samples.

Result: TabPFN transfer learning model provided the overall best performance, followed by baseline MLP. Weather history patterns in the 90-day pre-harvest period and seed moisture content were identified as the most influential predictors through permutation-based variable importance analysis.

Conclusion: Transfer learning approaches, particularly TabPFN, show promise for mycotoxin prediction in oat crops, with weather patterns during the pre-harvest period being critical predictive factors for contamination risk assessment.

Abstract: Mycotoxin contamination poses a significant risk to cereal crop quality, food safety, and agricultural productivity. Accurate prediction of mycotoxin levels can support early intervention strategies and reduce economic losses. This study investigates the use of neural networks and transfer learning models to predict mycotoxin contamination in Irish oat crops as a multi-response prediction task. Our dataset comprises oat samples collected in Ireland, containing a mix of environmental, agronomic, and geographical predictors. Five modelling approaches were evaluated: a baseline multilayer perceptron (MLP), an MLP with pre-training, and three transfer learning models; TabPFN, TabNet, and FT-Transformer. Model performance was evaluated using regression (RMSE, $R^2$) and classification (AUC, F1) metrics, with results reported per toxin and on average. Additionally, permutation-based variable importance analysis was conducted to identify the most influential predictors across both prediction tasks. The transfer learning approach TabPFN provided the overall best performance, followed by the baseline MLP. Our variable importance analysis revealed that weather history patterns in the 90-day pre-harvest period were the most important predictors, alongside seed moisture content.

[480] Calibrating LLM Judges: Linear Probes for Fast and Reliable Uncertainty Estimation

Bhaktipriya Radharapu, Eshika Saxena, Kenneth Li, Chenxi Whitehouse, Adina Williams, Nicola Cancedda

Main category: cs.LG

TL;DR: Linear probes trained with Brier score loss provide calibrated uncertainty estimates from LLM judges’ hidden states, offering 10x computational savings and better calibration than existing methods.

Details

Motivation: LLM-based judges are increasingly used in industry applications, but existing uncertainty estimation methods (verbalized confidence and multi-generation approaches) are either poorly calibrated or computationally expensive, creating a need for efficient, well-calibrated uncertainty estimates for production deployment.

Method: Introduce linear probes trained with a Brier score-based loss to extract calibrated uncertainty estimates directly from reasoning judges’ hidden states, requiring no additional model training. The approach is evaluated on both objective tasks (reasoning, mathematics, factuality, coding) and subjective human preference judgments.

Result: Probes achieve superior calibration compared to existing methods with approximately 10x computational savings, generalize robustly to unseen evaluation domains, and deliver higher accuracy on high-confidence predictions. However, they produce conservative estimates that underperform on easier datasets but may benefit safety-critical deployments prioritizing low false-positive rates.

Conclusion: Interpretability-based uncertainty estimation using linear probes provides a practical and scalable plug-and-play solution for LLM judges in production, offering efficient, well-calibrated uncertainty estimates without additional model training.

Abstract: As LLM-based judges become integral to industry applications, obtaining well-calibrated uncertainty estimates efficiently has become critical for production deployment. However, existing techniques, such as verbalized confidence and multi-generation methods, are often either poorly calibrated or computationally expensive. We introduce linear probes trained with a Brier score-based loss to provide calibrated uncertainty estimates from reasoning judges’ hidden states, requiring no additional model training. We evaluate our approach on both objective tasks (reasoning, mathematics, factuality, coding) and subjective human preference judgments. Our results demonstrate that probes achieve superior calibration compared to existing methods with $\approx10$x computational savings, generalize robustly to unseen evaluation domains, and deliver higher accuracy on high-confidence predictions. However, probes produce conservative estimates that underperform on easier datasets but may benefit safety-critical deployments prioritizing low false-positive rates. Overall, our work demonstrates that interpretability-based uncertainty estimation provides a practical and scalable plug-and-play solution for LLM judges in production.

[481] The Affine Divergence: Aligning Activation Updates Beyond Normalisation

George Bird

Main category: cs.LG

TL;DR: The paper identifies a mismatch between ideal and actual activation updates during gradient descent, proposes normalization as a solution from first principles, introduces new normalization methods including PatchNorm, and reframes normalizers as activation-function-like maps to prioritize representations.

Details

Motivation: There's a systematic mismatch between mathematically ideal and effective activation updates during gradient descent. Activations are more directly impactful for optimization since they're closer to the loss in computational graphs and carry sample-dependent information, but their updates don't take optimal steepest-descent steps due to non-ideal sample-wise scaling across various layer types.

Method: The paper proposes correcting activation update scaling through normalization derived from first principles. It introduces a new normalization approach functionally distinct from modern normalizations (without scale-invariance) and presents “PatchNorm” for convolutional layers - a compositionally inseparable normalizer. Normalizers are reframed as activation-function-like maps with parameterized scaling.

Result: The proposed alternative normalization method outperforms conventional normalizers across several tests. The analysis provides a fresh conceptual reframe of normalization’s action, supported by auxiliary experiments. The approach yields new functions that are empirically validated and raises questions about the affine + nonlinear approach to model creation.

Conclusion: This work offers a theoretical-principled approach to normalization that yields new empirically validated functions, provides an alternative mechanistic framework for understanding normalization, and suggests normalizers should be decomposed into activation-function-like maps with parameterized scaling to better prioritize representations during optimization.

Abstract: A systematic mismatch exists between mathematically ideal and effective activation updates during gradient descent. As intended, parameters update in their direction of steepest descent. However, activations are argued to constitute a more directly impactful quantity to prioritise in optimisation, as they are closer to the loss in the computational graph and carry sample-dependent information through the network. Yet their propagated updates do not take the optimal steepest-descent step. These quantities exhibit non-ideal sample-wise scaling across affine, convolutional, and attention layers. Solutions to correct for this are trivial and, entirely incidentally, derive normalisation from first principles despite motivational independence. Consequently, such considerations offer a fresh and conceptual reframe of normalisation’s action, with auxiliary experiments bolstering this mechanistically. Moreover, this analysis makes clear a second possibility: a solution that is functionally distinct from modern normalisations, without scale-invariance, yet remains empirically successful, outperforming conventional normalisers across several tests. This is presented as an alternative to the affine map. This generalises to convolution via a new functional form, “PatchNorm”, a compositionally inseparable normaliser. Together, these provide an alternative mechanistic framework that adds to, and counters some of, the discussion of normalisation. Further, it is argued that normalisers are better decomposed into activation-function-like maps with parameterised scaling, thereby aiding the prioritisation of representations during optimisation. Overall, this constitutes a theoretical-principled approach that yields several new functions that are empirically validated and raises questions about the affine + nonlinear approach to model creation.

[482] Amortized Inference for Model Rocket Aerodynamics: Learning to Estimate Physical Parameters from Simulation

Rohit Pandey, Rohan Pandey

Main category: cs.LG

TL;DR: A simulation-based amortized inference approach predicts rocket aerodynamic parameters from synthetic flight data, achieving accurate apogee predictions on real flights without real-world training data.

Details

Motivation: Traditional methods for predicting rocket flight performance require expensive CFD simulations or extensive real flight data collection, which is costly and time-consuming for amateur rocketry.

Method: Train a neural network on 10,000 synthetic flights generated from a physics simulator to learn inverse mapping from apogee measurements to aerodynamic parameters (drag coefficient and thrust correction factor), then apply directly to real flights without fine-tuning.

Result: Achieved 12.3 m mean absolute error in apogee prediction on 8 real flights, outperforming OpenRocket baseline predictions, demonstrating successful sim-to-real transfer with zero real training examples.

Conclusion: The simulation-based amortized inference approach enables accurate aerodynamic parameter estimation from limited real flight data, providing quantitative insights into physics-reality gaps while supporting amateur rocketry community with publicly available implementation.

Abstract: Accurate prediction of model rocket flight performance requires estimating aerodynamic parameters that are difficult to measure directly. Traditional approaches rely on computational fluid dynamics or empirical correlations, while data-driven methods require extensive real flight data that is expensive and time-consuming to collect. We present a simulation-based amortized inference approach that trains a neural network on synthetic flight data generated from a physics simulator, then applies the learned model to real flights without any fine-tuning. Our method learns to invert the forward physics model, directly predicting drag coefficient and thrust correction factor from a single apogee measurement combined with motor and configuration features. In this proof-of-concept study, we train on 10,000 synthetic flights and evaluate on 8 real flights, achieving a mean absolute error of 12.3 m in apogee prediction - demonstrating promising sim-to-real transfer with zero real training examples. Analysis reveals a systematic positive bias in predictions, providing quantitative insight into the gap between idealized physics and real-world flight conditions. We additionally compare against OpenRocket baseline predictions, showing that our learned approach reduces apogee prediction error. Our implementation is publicly available to support reproducibility and adoption in the amateur rocketry community.

[483] Temporal Visual Semantics-Induced Human Motion Understanding with Large Language Models

Zheng Xing, Weibing Zhao

Main category: cs.LG

TL;DR: This paper proposes a novel unsupervised human motion segmentation method that integrates temporal vision semantics from LLM into subspace clustering, outperforming state-of-the-art approaches on four benchmark datasets.

Details

Motivation: Traditional subspace clustering methods for human motion segmentation overlook temporal semantic exploration. The authors aim to leverage temporal vision semantics derived from motion sequences using LLM's image-to-text capabilities to enhance clustering performance.

Method: 1) Extract textual motion information from consecutive frames using LLM to determine if they depict the same motion; 2) Learn temporal neighboring information from LLM responses; 3) Develop TVS-integrated subspace clustering with temporal regularizer; 4) Implement segmentation with temporal constraints; 5) Introduce feedback-enabled framework for continuous optimization.

Result: The proposed method outperforms existing state-of-the-art approaches on four benchmark human motion datasets, demonstrating the effectiveness of incorporating temporal vision semantics from LLM into subspace clustering.

Conclusion: Integrating temporal vision semantics extracted via LLM into subspace clustering significantly improves human motion segmentation performance, showing the value of leveraging semantic temporal information in unsupervised motion analysis.

Abstract: Unsupervised human motion segmentation (HMS) can be effectively achieved using subspace clustering techniques. However, traditional methods overlook the role of temporal semantic exploration in HMS. This paper explores the use of temporal vision semantics (TVS) derived from human motion sequences, leveraging the image-to-text capabilities of a large language model (LLM) to enhance subspace clustering performance. The core idea is to extract textual motion information from consecutive frames via LLM and incorporate this learned information into the subspace clustering framework. The primary challenge lies in learning TVS from human motion sequences using LLM and integrating this information into subspace clustering. To address this, we determine whether consecutive frames depict the same motion by querying the LLM and subsequently learn temporal neighboring information based on its response. We then develop a TVS-integrated subspace clustering approach, incorporating subspace embedding with a temporal regularizer that induces each frame to share similar subspace embeddings with its temporal neighbors. Additionally, segmentation is performed based on subspace embedding with a temporal constraint that induces the grouping of each frame with its temporal neighbors. We also introduce a feedback-enabled framework that continuously optimizes subspace embedding based on the segmentation output. Experimental results demonstrate that the proposed method outperforms existing state-of-the-art approaches on four benchmark human motion datasets.

[484] Interpretable Perturbation Modeling Through Biomedical Knowledge Graphs

Pascal Passigan, Kevin zhu, Angelina Ning

Main category: cs.LG

TL;DR: A graph neural network framework that predicts drug-induced gene expression changes by integrating biomedical knowledge graphs with multimodal embeddings, outperforming baseline models in predicting transcriptional perturbations.

Details

Motivation: Existing deep learning frameworks focus on binary drug-disease associations rather than granular gene perturbation effects, which are crucial for understanding drug mechanisms, predicting off-target effects, and drug repurposing.

Method: Constructed a merged biomedical graph integrating PrimeKG++ (augmented with semantic embeddings) and LINCS L1000 data, initialized with multimodal embeddings from foundation models (MolFormerXL, BioBERT). Trained a graph attention network (GAT) with downstream prediction head to learn delta expression profiles for 978 landmark genes given drug-cell pairs.

Result: The framework outperforms MLP baselines for differentially expressed genes prediction under scaffold and random splits. Ablation experiments with edge shuffling and node feature randomization demonstrate that biomedical knowledge graph edges enhance perturbation-level prediction.

Conclusion: The framework provides a path toward mechanistic drug modeling by moving beyond binary drug-disease associations to predict granular transcriptional effects of therapeutic interventions.

Abstract: Understanding how small molecules perturb gene expression is essential for uncovering drug mechanisms, predicting off-target effects, and identifying repurposing opportunities. While prior deep learning frameworks have integrated multimodal embeddings into biomedical knowledge graphs (BKGs) and further improved these representations through graph neural network message-passing paradigms, these models have been applied to tasks such as link prediction and binary drug-disease association, rather than the task of gene perturbation, which may unveil more about mechanistic transcriptomic effects. To address this gap, we construct a merged biomedical graph that integrates (i) PrimeKG++, an augmentation of PrimeKG containing semantically rich embeddings for nodes with (ii) LINCS L1000 drug and cell line nodes, initialized with multimodal embeddings from foundation models such as MolFormerXL and BioBERT. Using this heterogeneous graph, we train a graph attention network (GAT) with a downstream prediction head that learns the delta expression profile of over 978 landmark genes for a given drug-cell pair. Our results show that our framework outperforms MLP baselines for differentially expressed genes (DEG) – which predict the delta expression given a concatenated embedding of drug features, target features, and baseline cell expression – under the scaffold and random splits. Ablation experiments with edge shuffling and node feature randomization further demonstrate that the edges provided by biomedical KGs enhance perturbation-level prediction. More broadly, our framework provides a path toward mechanistic drug modeling: moving beyond binary drug-disease association tasks to granular transcriptional effects of therapeutic intervention.

[485] Graph Attention-based Adaptive Transfer Learning for Link Prediction

Huashen Lu, Wensheng Gan, Guoting Chen, Zhichao Huang, Philip S. Yu

Main category: cs.LG

TL;DR: GAATNet is a Graph Attention Adaptive Transfer Network that combines pre-training and fine-tuning for link prediction, addressing challenges with large-scale sparse graphs and transfer learning across different datasets.

Details

Motivation: Existing GNN-based link prediction methods struggle with large-scale sparse graphs and require high dataset alignment for transfer learning. Self-supervised methods have been successful but haven't effectively leveraged transfer learning across different graph datasets.

Method: GAATNet combines pre-training and fine-tuning to capture global node embeddings across different-scale datasets. It uses two key strategies: 1) incorporating distant neighbor embeddings as biases in self-attention to capture global features, and 2) introducing a lightweight self-adapter module during fine-tuning to improve efficiency.

Result: Comprehensive experiments on seven public datasets demonstrate that GAATNet achieves state-of-the-art performance in link prediction tasks.

Conclusion: GAATNet provides a general and scalable solution for effectively integrating GNNs with transfer learning for link prediction tasks, with publicly available source code and datasets.

Abstract: Graph neural networks (GNNs) have brought revolutionary advancements to the field of link prediction (LP), providing powerful tools for mining potential relationships in graphs. However, existing methods face challenges when dealing with large-scale sparse graphs and the need for a high degree of alignment between different datasets in transfer learning. Besides, although self-supervised methods have achieved remarkable success in many graph tasks, prior research has overlooked the potential of transfer learning to generalize across different graph datasets. To address these limitations, we propose a novel Graph Attention Adaptive Transfer Network (GAATNet). It combines the advantages of pre-training and fine-tuning to capture global node embedding information across datasets of different scales, ensuring efficient knowledge transfer and improved LP performance. To enhance the model’s generalization ability and accelerate training, we design two key strategies: 1) Incorporate distant neighbor embeddings as biases in the self-attention module to capture global features. 2) Introduce a lightweight self-adapter module during fine-tuning to improve training efficiency. Comprehensive experiments on seven public datasets demonstrate that GAATNet achieves state-of-the-art performance in LP tasks. This study provides a general and scalable solution for LP tasks to effectively integrate GNNs with transfer learning. The source code and datasets are publicly available at https://github.com/DSI-Lab1/GAATNet

[486] Cardiac mortality prediction in patients undergoing PCI based on real and synthetic data

Daniil Burakov, Ivan Petrov, Dmitrii Khelimskii, Ivan Bessonov, Mikhail Lazarev

Main category: cs.LG

TL;DR: Machine learning models for predicting 3-year cardiac death after PCI show high overall accuracy but poor minority class performance; synthetic data augmentation improves minority recall and probability quality while identifying key clinical predictors.

Details

Motivation: To develop a predictive model for cardiac death risk after PCI using real and synthetic data, addressing class imbalance issues in clinical prediction and identifying the most influential mortality factors.

Method: Analyzed 2,044 PCI patients with bifurcation lesions; applied multiple ML models to predict 3-year mortality; generated 500 synthetic samples to address class imbalance; used permutation feature importance to identify key predictors; conducted feature removal experiments.

Result: Without oversampling: high overall accuracy (0.92-0.93) but poor minority class performance. With augmentation: improved minority-class recall with minimal AUROC loss, better probability quality, more clinically reasonable risk estimates. Top features: Age, Ejection Fraction, Peripheral Artery Disease, Cerebrovascular Disease.

Conclusion: Synthetic data augmentation effectively exposes and reduces brittleness in imbalanced clinical prediction, improving minority class performance while maintaining overall metrics. Routine reporting of probability quality and stress tests alongside standard metrics is recommended.

Abstract: Patient status, angiographic and procedural characteristics encode crucial signals for predicting long-term outcomes after percutaneous coronary intervention (PCI). The aim of the study was to develop a predictive model for assessing the risk of cardiac death based on the real and synthetic data of patients undergoing PCI and to identify the factors that have the greatest impact on mortality. We analyzed 2,044 patients, who underwent a PCI for bifurcation lesions. The primary outcome was cardiac death at 3-year follow-up. Several machine learning models were applied to predict three-year mortality after PCI. To address class imbalance and improve the representation of the minority class, an additional 500 synthetic samples were generated and added to the training set. To evaluate the contribution of individual features to model performance, we applied permutation feature importance. An additional experiment was conducted to evaluate how the model’s predictions would change after removing non-informative features from the training and test datasets. Without oversampling, all models achieve high overall accuracy (0.92-0.93), yet they almost completely ignore the minority class. Across models, augmentation consistently increases minority-class recall with minimal loss of AUROC, improves probability quality, and yields more clinically reasonable risk estimates on the constructed severe profiles. According to feature importance analysis, four features emerged as the most influential: Age, Ejection Fraction, Peripheral Artery Disease, and Cerebrovascular Disease. These results show that straightforward augmentation with realistic and extreme cases can expose, quantify, and reduce brittleness in imbalanced clinical prediction using only tabular records, and motivate routine reporting of probability quality and stress tests alongside headline metrics.

[487] The Physics Constraint Paradox: When Removing Explicit Constraints Improves Physics-Informed Data for Machine Learning

Rahul D Ray

Main category: cs.LG

TL;DR: Physics-constrained data generation for grating couplers reveals paradox: explicit energy conservation is redundant when equations are physically consistent, while Fabry-Perot oscillations dominate bandwidth variability and noise pipelines can introduce unphysical artifacts.

Details

Motivation: Physics-constrained data generation is needed for scientific ML where real data are scarce, but existing approaches often over-constrain models without identifying which physical components are actually necessary.

Method: Systematic ablation study of a physics-informed grating coupler spectrum generator that maps 5 geometric parameters to 100-point spectral responses, selectively removing explicit energy conservation enforcement, Fabry-Perot oscillations, bandwidth variation, and noise.

Result: Explicit energy conservation enforcement is mathematically redundant (mean error ~7×10⁻⁹). Fabry-Perot oscillations dominate bandwidth variability (72% reduction in bandwidth spread when removed). Noise pipelines introduce 0.5% unphysical negative absorption. Generator operates at 200 samples/second. Removing Fabry-Perot oscillations improves bandwidth prediction accuracy by 31.3% R² and reduces RMSE by 73.8%.

Conclusion: Findings provide actionable guidance for physics-informed dataset design and highlight ML performance as a diagnostic tool for assessing constraint relevance. There’s a clear physics-learnability trade-off where some constraints hinder ML performance despite being physically correct.

Abstract: Physics-constrained data generation is essential for machine learning in scientific domains where real data are scarce; however, existing approaches often over-constrain models without identifying which physical components are necessary. We present a systematic ablation study of a physics-informed grating coupler spectrum generator that maps five geometric parameters to 100-point spectral responses. By selectively removing explicit energy conservation enforcement, Fabry-Perot oscillations, bandwidth variation, and noise, we uncover a physics constraint paradox: explicit energy conservation enforcement is mathematically redundant when the underlying equations are physically consistent, with constrained and unconstrained variants achieving identical conservation accuracy (mean error approximately 7 x 10^-9). In contrast, Fabry-Perot oscillations dominate threshold-based bandwidth variability, accounting for a 72 percent reduction in half-maximum bandwidth spread when removed (with bandwidth spread reduced from 132.3 nm to 37.4 nm). We further identify a subtle pitfall: standard noise-addition-plus-renormalization pipelines introduce 0.5 percent unphysical negative absorption values. The generator operates at 200 samples per second, enabling high-throughput data generation and remaining orders of magnitude faster than typical full-wave solvers reported in the literature. Finally, downstream machine learning evaluation reveals a clear physics-learnability trade-off: while central wavelength prediction remains unaffected, removing Fabry-Perot oscillations improves bandwidth prediction accuracy by 31.3 percent in R-squared and reduces RMSE by 73.8 percent. These findings provide actionable guidance for physics-informed dataset design and highlight machine learning performance as a diagnostic tool for assessing constraint relevance.

[488] LuxIA: A Lightweight Unitary matriX-based Framework Built on an Iterative Algorithm for Photonic Neural Network Training

Tzamn Melendez Carmona, Federico Marchesin, Marco P. Abrate, Peter Bienstman, Stefano Di Carlo, Alessandro Savino Senior

Main category: cs.LG

TL;DR: LuxIA introduces a Slicing method for efficient transfer matrix computation in photonic neural networks, enabling scalable simulation and training with reduced memory/time usage.

Details

Motivation: Current PNN simulation tools face scalability challenges due to computational demands of transfer matrix calculations, limiting training of large-scale photonic neural networks.

Method: Developed the Slicing method for efficient transfer matrix computation compatible with back-propagation, integrated into the unified LuxIA simulation and training framework.

Result: LuxIA substantially reduces memory usage and execution time, consistently surpassing existing tools in speed and scalability across various photonic architectures and datasets (MNIST, Digits, Olivetti Faces).

Conclusion: LuxIA advances PNN simulation state-of-the-art, making large complex architectures feasible, accelerating AI hardware innovation through photonic technologies, and paving way for more efficient PNN research.

Abstract: PNNs present promising opportunities for accelerating machine learning by leveraging the unique benefits of photonic circuits. However, current state of the art PNN simulation tools face significant scalability challenges when training large-scale PNNs, due to the computational demands of transfer matrix calculations, resulting in high memory and time consumption. To overcome these limitations, we introduce the Slicing method, an efficient transfer matrix computation approach compatible with back-propagation. We integrate this method into LuxIA, a unified simulation and training framework. The Slicing method substantially reduces memory usage and execution time, enabling scalable simulation and training of large PNNs. Experimental evaluations across various photonic architectures and standard datasets, including MNIST, Digits, and Olivetti Faces, show that LuxIA consistently surpasses existing tools in speed and scalability. Our results advance the state of the art in PNN simulation, making it feasible to explore and optimize larger, more complex architectures. By addressing key computational bottlenecks, LuxIA facilitates broader adoption and accelerates innovation in AI hardware through photonic technologies. This work paves the way for more efficient and scalable photonic neural network research and development.

[489] LLMTM: Benchmarking and Optimizing LLMs for Temporal Motif Analysis in Dynamic Graphs

Bing Hao, Minglai Shao, Zengyi Wo, Yunlong Chu, Yuhang Liu, Ruijie Wang

Main category: cs.LG

TL;DR: LLMTM benchmark evaluates LLMs on temporal motif tasks, reveals prompting impacts, develops tool-augmented agent for high accuracy, and proposes cost-effective structure-aware dispatcher.

Details

Motivation: Temporal motifs are fundamental to understanding dynamic graph evolution and anomalies, but LLM capabilities for temporal motif analysis remain unexplored despite their widespread application.

Method: Created LLMTM benchmark with 6 tasks across 9 temporal motif types; tested 9 LLMs with different prompting techniques; developed tool-augmented LLM agent; proposed structure-aware dispatcher that considers graph structure and LLM cognitive load.

Result: Benchmark reveals performance variations across LLMs and prompting techniques; tool-augmented agent achieves high accuracy but with high cost; structure-aware dispatcher maintains high accuracy while significantly reducing cost.

Conclusion: LLMs show promise for temporal motif analysis, but cost-effectiveness requires intelligent query dispatch; structure-aware dispatcher balances accuracy and cost by leveraging both standard LLM prompting and tool-augmented approaches.

Abstract: The widespread application of Large Language Models (LLMs) has motivated a growing interest in their capacity for processing dynamic graphs. Temporal motifs, as an elementary unit and important local property of dynamic graphs which can directly reflect anomalies and unique phenomena, are essential for understanding their evolutionary dynamics and structural features. However, leveraging LLMs for temporal motif analysis on dynamic graphs remains relatively unexplored. In this paper, we systematically study LLM performance on temporal motif-related tasks. Specifically, we propose a comprehensive benchmark, LLMTM (Large Language Models in Temporal Motifs), which includes six tailored tasks across nine temporal motif types. We then conduct extensive experiments to analyze the impacts of different prompting techniques and LLMs (including nine models: openPangu-7B, the DeepSeek-R1-Distill-Qwen series, Qwen2.5-32B-Instruct, GPT-4o-mini, DeepSeek-R1, and o3) on model performance. Informed by our benchmark findings, we develop a tool-augmented LLM agent that leverages precisely engineered prompts to solve these tasks with high accuracy. Nevertheless, the high accuracy of the agent incurs a substantial cost. To address this trade-off, we propose a simple yet effective structure-aware dispatcher that considers both the dynamic graph’s structural properties and the LLM’s cognitive load to intelligently dispatch queries between the standard LLM prompting and the more powerful agent. Our experiments demonstrate that the structure-aware dispatcher effectively maintains high accuracy while reducing cost.

[490] Hierarchical Stacking Optimization Using Dirichlet’s Process (SoDip): Towards Accelerated Design for Graft Polymerization

Amgad Ahmed Ali Ibrahim, Hein Htet, Ryoji Asahi

Main category: cs.LG

TL;DR: SoDip framework uses hierarchical stacking with transformers and Bayesian methods to improve reproducibility in radiation-induced grafting by modeling morphological variability and uncertainty.

Details

Motivation: Radiation-induced grafting suffers from reproducibility issues due to unreported variability in base-film morphology (crystallinity, grain orientation, free volume) that affects monomer diffusion, radical distribution, and grafting outcomes.

Method: Hierarchical stacking optimization framework (SoDip) integrating: (1) decoder-only Transformer for textual process descriptors, (2) TabNet and XGBoost for multimodal feature interactions, (3) Gaussian Process Regression with Dirichlet Process Mixture Models for uncertainty quantification, and (4) Bayesian Optimization for exploring synthesis space.

Result: SoDip achieved ~33% improvement over Gaussian Process Regression alone in cross-validation, with calibrated confidence intervals that identify low-reproducibility regimes, outperforming prior models.

Conclusion: The framework successfully integrates sparse textual and numerical inputs of varying quality, establishing a foundation for reproducible, morphology-aware design in graft polymerization research.

Abstract: Radiation-induced grafting (RIG) enables precise functionalization of polymer films for ion-exchange membranes, CO2-separation membranes, and battery electrolytes by generating radicals on robust substrates to graft desired monomers. However, reproducibility remains limited due to unreported variability in base-film morphology (crystallinity, grain orientation, free volume), which governs monomer diffusion, radical distribution, and the Trommsdorff effect, leading to spatial graft gradients and performance inconsistencies. We present a hierarchical stacking optimization framework with a Dirichlet’s Process (SoDip), a hierarchical data-driven framework integrating: (1) a decoder-only Transformer (DeepSeek-R1) to encode textual process descriptors (irradiation source, grafting type, substrate manufacturer); (2) TabNet and XGBoost for modelling multimodal feature interactions; (3) Gaussian Process Regression (GPR) with Dirichlet Process Mixture Models (DPMM) for uncertainty quantification and heteroscedasticity; and (4) Bayesian Optimization for efficient exploration of high-dimensional synthesis space. A diverse dataset was curated using ChemDataExtractor 2.0 and WebPlotDigitizer, incorporating numerical and textual variables across hundreds of RIG studies. In cross-validation, SoDip achieved ~33% improvement over GPR while providing calibrated confidence intervals that identify low-reproducibility regimes. Its stacked architecture integrates sparse textual and numerical inputs of varying quality, outperforming prior models and establishing a foundation for reproducible, morphology-aware design in graft polymerization research.

[491] Valori: A Deterministic Memory Substrate for AI Systems

Varshith Gudur

Main category: cs.LG

TL;DR: Valori introduces a deterministic AI memory substrate using fixed-point arithmetic (Q16.16) to guarantee bit-identical memory states and search results across hardware platforms, addressing non-determinism in vector embedding systems.

Details

Motivation: Current AI systems using floating-point arithmetic for vector embeddings suffer from fundamental non-determinism - identical models, inputs, and code produce different memory states and retrieval results across hardware architectures (x86 vs. ARM). This prevents replayability, safe deployment, post-hoc verification, and compromises audit trails in regulated sectors.

Method: Valori replaces floating-point memory operations with fixed-point arithmetic (Q16.16) and models memory as a replayable state machine. It enforces determinism at the memory boundary by addressing non-determinism that arises before indexing or retrieval.

Result: Valori guarantees bit-identical memory states, snapshots, and search results across platforms. The system demonstrates that deterministic memory is achievable and necessary for trustworthy AI systems.

Conclusion: Deterministic memory is a necessary primitive for trustworthy AI systems. Valori provides a practical solution to the non-determinism problem in AI memory systems, with an open-source implementation available.

Abstract: Modern AI systems rely on vector embeddings stored and searched using floating-point arithmetic. While effective for approximate similarity search, this design introduces fundamental non-determinism: identical models, inputs, and code can produce different memory states and retrieval results across hardware architectures (e.g., x86 vs. ARM). This prevents replayability and safe deployment, leading to silent data divergence that prevents post-hoc verification and compromises audit trails in regulated sectors. We present Valori, a deterministic AI memory substrate that replaces floating-point memory operations with fixed-point arithmetic (Q16.16) and models memory as a replayable state machine. Valori guarantees bit-identical memory states, snapshots, and search results across platforms. We demonstrate that non-determinism arises before indexing or retrieval and show how Valori enforces determinism at the memory boundary. Our results suggest that deterministic memory is a necessary primitive for trustworthy AI systems. The reference implementation is open-source and available at https://github.com/varshith-Git/Valori-Kernel (archived at https://zenodo.org/records/18022660).

[492] DBAW-PIKAN: Dynamic Balance Adaptive Weight Kolmogorov-Arnold Neural Network for Solving Partial Differential Equations

Guokan Chen, Yao Xiao

Main category: cs.LG

TL;DR: Proposes DBAW-PIKAN, a novel PINN variant combining Kolmogorov-Arnold networks with adaptive weighting to overcome gradient flow stiffness and spectral bias in multi-scale/high-frequency problems.

Details

Motivation: PINNs face persistent challenges with stiffness in gradient flow and spectral bias when dealing with multi-scale or high-frequency features, limiting their predictive capabilities for complex scientific computing problems.

Method: DBAW-PIKAN combines Kolmogorov-Arnold network architecture (based on learnable B-splines) with adaptive weighting strategy featuring dynamic decay upper bound to mitigate gradient-related failure modes.

Result: Accelerates convergence and improves solution accuracy by at least an order of magnitude compared to baseline models without adding computational complexity, validated on Klein-Gordon, Burgers, and Helmholtz equations.

Conclusion: DBAW-PIKAN demonstrates significant advantages in enhancing both accuracy and generalization performance for PINNs in challenging multi-scale/high-frequency problems.

Abstract: Physics-informed neural networks (PINNs) have led to significant advancements in scientific computing by integrating fundamental physical principles with advanced data-driven techniques. However, when dealing with problems characterized by multi-scale or high-frequency features, PINNs encounter persistent and severe challenges related to stiffness in gradient flow and spectral bias, which significantly limit their predictive capabilities. To address these issues, this paper proposes a Dynamic Balancing Adaptive Weighting Physics-Informed Kolmogorov-Arnold Network (DBAW-PIKAN), designed to mitigate such gradient-related failure modes and overcome the bottlenecks in function representation. The core of DBAW-PIKAN combines the Kolmogorov-Arnold network architecture, based on learnable B-splines, with an adaptive weighting strategy that incorporates a dynamic decay upper bound. Compared to baseline models, the proposed method accelerates the convergence process and improves solution accuracy by at least an order of magnitude without introducing additional computational complexity. A series of numerical benchmarks, including the Klein-Gordon, Burgers, and Helmholtz equations, demonstrate the significant advantages of DBAW-PIKAN in enhancing both accuracy and generalization performance.

[493] Cluster Aggregated GAN (CAG): A Cluster-Based Hybrid Model for Appliance Pattern Generation

Zikun Guoa, Adeyinka. P. Adedigbaa, Rammohan Mallipeddi

Main category: cs.LG

TL;DR: Proposes Cluster Aggregated GAN framework for synthetic appliance data generation that separates intermittent and continuous appliances into specialized branches with clustering for intermittent devices and LSTM for continuous ones.

Details

Motivation: Synthetic appliance data is crucial for NILM research but scarce; existing GAN methods treat all devices uniformly, ignoring behavioral differences between intermittent and continuous appliances, leading to unstable training and poor output fidelity.

Method: Hybrid generative framework that routes appliances to specialized branches based on behavior: intermittent appliances use clustering module to group similar activation patterns with dedicated generators per cluster; continuous appliances use LSTM-based generator with sequence compression for stability.

Result: Outperforms baseline methods on UVIC smart plug dataset across realism, diversity, and training stability metrics; clustering as active generative component improves interpretability and scalability.

Conclusion: The framework establishes an effective approach for synthetic load generation in NILM research by addressing behavioral differences between appliance types.

Abstract: Synthetic appliance data are essential for developing non-intrusive load monitoring algorithms and enabling privacy preserving energy research, yet the scarcity of labeled datasets remains a significant barrier. Recent GAN-based methods have demonstrated the feasibility of synthesizing load patterns, but most existing approaches treat all devices uniformly within a single model, neglecting the behavioral differences between intermittent and continuous appliances and resulting in unstable training and limited output fidelity. To address these limitations, we propose the Cluster Aggregated GAN framework, a hybrid generative approach that routes each appliance to a specialized branch based on its behavioral characteristics. For intermittent appliances, a clustering module groups similar activation patterns and allocates dedicated generators for each cluster, ensuring that both common and rare operational modes receive adequate modeling capacity. Continuous appliances follow a separate branch that employs an LSTM-based generator to capture gradual temporal evolution while maintaining training stability through sequence compression. Extensive experiments on the UVIC smart plug dataset demonstrate that the proposed framework consistently outperforms baseline methods across metrics measuring realism, diversity, and training stability, and that integrating clustering as an active generative component substantially improves both interpretability and scalability. These findings establish the proposed framework as an effective approach for synthetic load generation in non-intrusive load monitoring research.

[494] Co-GRPO: Co-Optimized Group Relative Policy Optimization for Masked Diffusion Model

Renping Zhou, Zanlin Ni, Tianyi Chen, Zeyu Liu, Yang Yue, Yulin Wang, Yuxuan Wang, Jingshu Liu, Gao Huang

Main category: cs.LG

TL;DR: Co-GRPO reformulates Masked Diffusion Models as a unified MDP to jointly optimize model parameters and inference schedules using trajectory-level Group Relative Policy Optimization, addressing the training-inference discrepancy in MDMs.

Details

Motivation: There's a fundamental discrepancy between training and inference in Masked Diffusion Models (MDMs). MDMs are trained with a simplified single-step BERT-style objective but require multi-step iterative inference with complex schedules that dictate token-decoding trajectories. These inference schedules are never optimized during training, creating a disconnect between training paradigm and inference reality.

Method: Co-GRPO reformulates MDM generation as a unified Markov Decision Process (MDP) that jointly incorporates both the model and inference schedule. It applies Group Relative Policy Optimization at the trajectory level to cooperatively optimize model parameters and schedule parameters under a shared reward, without requiring costly backpropagation through multi-step generation.

Result: Empirical results across four benchmarks (ImageReward, HPS, GenEval, and DPG-Bench) demonstrate the effectiveness of Co-GRPO. The approach substantially improves generation quality by aligning training with inference more thoroughly.

Conclusion: Co-GRPO provides a holistic optimization framework that addresses the training-inference discrepancy in MDMs by jointly optimizing model and schedule parameters at the trajectory level, leading to improved generation performance across multiple benchmarks.

Abstract: Recently, Masked Diffusion Models (MDMs) have shown promising potential across vision, language, and cross-modal generation. However, a notable discrepancy exists between their training and inference procedures. In particular, MDM inference is a multi-step, iterative process governed not only by the model itself but also by various schedules that dictate the token-decoding trajectory (e.g., how many tokens to decode at each step). In contrast, MDMs are typically trained using a simplified, single-step BERT-style objective that masks a subset of tokens and predicts all of them simultaneously. This step-level simplification fundamentally disconnects the training paradigm from the trajectory-level nature of inference, leaving the inference schedules never optimized during training. In this paper, we introduce Co-GRPO, which reformulates MDM generation as a unified Markov Decision Process (MDP) that jointly incorporates both the model and the inference schedule. By applying Group Relative Policy Optimization at the trajectory level, Co-GRPO cooperatively optimizes model parameters and schedule parameters under a shared reward, without requiring costly backpropagation through the multi-step generation process. This holistic optimization aligns training with inference more thoroughly and substantially improves generation quality. Empirical results across four benchmarks-ImageReward, HPS, GenEval, and DPG-Bench-demonstrate the effectiveness of our approach. For more details, please refer to our project page: https://co-grpo.github.io/ .

[495] When Algorithms Manage Humans: A Double Machine Learning Approach to Estimating Nonlinear Effects of Algorithmic Control on Gig Worker Performance and Wellbeing

Arunkumar V, Nivethitha S, Sharan Srinivas, Gangadharan G. R

Main category: cs.LG

TL;DR: Algorithmic management’s impact on worker wellbeing and performance is non-linear - supportive HR practices help wellbeing but their performance link weakens with ambiguous algorithmic oversight, then strengthens with transparent oversight.

Details

Motivation: To understand whether person-centered management can survive algorithmic systems, and to address limitations of standard tools that miss non-linear worker responses to algorithmic management.

Method: Used Double Machine Learning framework to estimate moderated mediation model without restrictive functional forms, analyzing survey data from 464 gig workers.

Result: Found clear nonmonotonic pattern: Supportive HR improves wellbeing, but its link to performance weakens with ambiguous algorithmic oversight, then strengthens with transparent, explainable oversight.

Conclusion: Simple linear models can miss patterns or suggest opposite conclusions; for platforms, clear rules and credible recourse make strong oversight workable; Double ML enables estimating conditional indirect effects without forcing linear shapes.

Abstract: A central question for the future of work is whether person centered management can survive when algorithms take on managerial roles. Standard tools often miss what is happening because worker responses to algorithmic systems are rarely linear. We use a Double Machine Learning framework to estimate a moderated mediation model without imposing restrictive functional forms. Using survey data from 464 gig workers, we find a clear nonmonotonic pattern. Supportive HR practices improve worker wellbeing, but their link to performance weakens in a murky middle where algorithmic oversight is present yet hard to interpret. The relationship strengthens again when oversight is transparent and explainable. These results show why simple linear specifications can miss the pattern and sometimes suggest the opposite conclusion. For platform design, the message is practical: control that is only partly defined creates confusion, but clear rules and credible recourse can make strong oversight workable. Methodologically, the paper shows how Double Machine Learning can be used to estimate conditional indirect effects in organizational research without forcing the data into a linear shape.

[496] Multi-Head Spectral-Adaptive Graph Anomaly Detection

Qingyue Cao, Bo Jin, Changwei Gong, Xin Tong, Wenzheng Li, Xiaodong Zhou

Main category: cs.LG

TL;DR: MHSA-GNN: A multi-head spectral-adaptive GNN that uses a hypernetwork to generate instance-specific Chebyshev filters for better graph anomaly detection, preventing over-smoothing while preserving high-frequency signals.

Details

Motivation: Existing graph anomaly detection methods struggle with complex abnormal patterns where anomalous nodes are disguised among normal ones, creating mixed homophily/heterophily. Current spectral GNNs use fixed global filters causing over-smoothing that erases critical high-frequency signals needed for fraud detection, lacking adaptability to different graph instances.

Method: Proposes MHSA-GNN with: 1) Lightweight hypernetwork that generates Chebyshev filter parameters tailored to each instance based on ‘spectral fingerprint’ (structural statistics + Rayleigh quotient features), 2) Multi-head mechanism with dual regularization: teacher-student contrastive learning for representation accuracy and Barlow Twins diversity loss for head orthogonality to prevent mode collapse.

Result: Extensive experiments on four real-world datasets show the method effectively preserves high-frequency abnormal signals and significantly outperforms state-of-the-art methods, especially demonstrating excellent robustness on highly heterogeneous datasets.

Conclusion: MHSA-GNN addresses limitations of fixed-filter spectral GNNs by enabling instance-adaptive filtering, successfully preserving critical high-frequency signals for anomaly detection while preventing over-smoothing, making it particularly effective for fraud detection in complex graph environments.

Abstract: Graph anomaly detection technology has broad applications in financial fraud and risk control. However, existing graph anomaly detection methods often face significant challenges when dealing with complex and variable abnormal patterns, as anomalous nodes are often disguised and mixed with normal nodes, leading to the coexistence of homophily and heterophily in the graph domain. Recent spectral graph neural networks have made notable progress in addressing this issue; however, current techniques typically employ fixed, globally shared filters. This ‘one-size-fits-all’ approach can easily cause over-smoothing, erasing critical high-frequency signals needed for fraud detection, and lacks adaptive capabilities for different graph instances. To solve this problem, we propose a Multi-Head Spectral-Adaptive Graph Neural Network (MHSA-GNN). The core innovation is the design of a lightweight hypernetwork that, conditioned on a ‘spectral fingerprint’ containing structural statistics and Rayleigh quotient features, dynamically generates Chebyshev filter parameters tailored to each instance. This enables a customized filtering strategy for each node and its local subgraph. Additionally, to prevent mode collapse in the multi-head mechanism, we introduce a novel dual regularization strategy that combines teacher-student contrastive learning (TSC) to ensure representation accuracy and Barlow Twins diversity loss (BTD) to enforce orthogonality among heads. Extensive experiments on four real-world datasets demonstrate that our method effectively preserves high-frequency abnormal signals and significantly outperforms existing state-of-the-art methods, especially showing excellent robustness on highly heterogeneous datasets.

[497] Learning from Negative Examples: Why Warning-Framed Training Data Teaches What It Warns Against

Tsogt-Ochir Enkhbayar

Main category: cs.LG

TL;DR: Warning-framed content in training data fails to teach LLMs to avoid warned-against behaviors - models reproduce flagged content at similar rates regardless of warnings due to overlapping feature activations between “describing” and “performing” actions.

Details

Motivation: To understand why language models fail to learn from warnings in training data, and why they reproduce warned-against content despite explicit cautionary framing.

Method: Experimental comparison of models exposed to warned content vs. direct content, sparse autoencoder analysis to examine feature activations, investigation of “stealth slip” phenomenon where conversational preambles rotate activations, and testing of prompting, inference-time steering, and training-time feature ablation approaches.

Result: Models reproduced flagged content at 76.7% rate with warnings vs. 83.3% without warnings (statistically indistinguishable). Sparse autoencoder analysis revealed overlapping latent features between “describing X” and “performing X” contexts, with Feature #8684 firing comparably in both warning and exploitation contexts. Prompting and inference-time steering failed to fix the issue, but training-time feature ablation was effective.

Conclusion: Statistical co-occurrence dominates over pragmatic interpretation in current LLM architectures - models learn what tends to follow a context rather than understanding why content appears there. Warning framing doesn’t create distinct feature representations, leading to failure in behavioral avoidance.

Abstract: Warning-framed content in training data (e.g., “DO NOT USE - this code is vulnerable”) does not, it turns out, teach language models to avoid the warned-against behavior. In experiments reported here, models exposed to such warnings reproduced the flagged content at rates statistically indistinguishable from models given the content directly (76.7% vs. 83.3%). Why? Sparse autoencoder analysis points to a failure of orthogonalization: “describing X” and “performing X” activate overlapping latent features. Feature #8684, which tracks code execution patterns, fires at comparable magnitude in both warning and exploitation contexts. A related phenomenon, what I call “stealth slip”, allows conversational preambles to rotate activations into subspaces that linear probes miss entirely. Prompting and inference-time steering do not fix this; training-time feature ablation does. The upshot is that statistical co-occurrence dominates over pragmatic interpretation in current architectures. Models learn what tends to follow a context, not why it appeared there.

[498] Hybrid Quantum-Classical Mixture of Experts: Unlocking Topological Advantage via Interference-Based Routing

Reda Heddad, Lamiae Bouanane

Main category: cs.LG

TL;DR: A hybrid quantum-classical MoE architecture uses quantum routing to achieve superior parameter efficiency and non-linear decision boundaries through quantum interference effects.

Details

Motivation: To address limitations of classical MoE architectures (expert imbalance, computational complexity) by leveraging quantum machine learning for more efficient routing mechanisms.

Method: Hybrid Quantum-Classical Mixture of Experts (QMoE) with Quantum Gating Network (Router) using quantum feature maps (Angle Embedding) and wave interference, combined with classical experts.

Result: Quantum Router achieves topological advantage on non-linearly separable data (Two Moons), validates Interference Hypothesis, shows robustness against quantum noise, and demonstrates superior parameter efficiency.

Conclusion: Quantum-enhanced routing offers practical advantages for NISQ hardware applications in federated learning, privacy-preserving ML, and adaptive systems through efficient non-linear decision boundaries.

Abstract: The Mixture-of-Experts (MoE) architecture has emerged as a powerful paradigm for scaling deep learning models, yet it is fundamentally limited by challenges such as expert imbalance and the computational complexity of classical routing mechanisms. This paper investigates the potential of Quantum Machine Learning (QML) to address these limitations through a novel Hybrid Quantum-Classical Mixture of Experts (QMoE) architecture. Specifically, we conduct an ablation study using a Quantum Gating Network (Router) combined with classical experts to isolate the source of quantum advantage. Our central finding validates the Interference Hypothesis: by leveraging quantum feature maps (Angle Embedding) and wave interference, the Quantum Router acts as a high-dimensional kernel method, enabling the modeling of complex, non-linear decision boundaries with superior parameter efficiency compared to its classical counterparts. Experimental results on non-linearly separable data, such as the Two Moons dataset, demonstrate that the Quantum Router achieves a significant topological advantage, effectively “untangling” data distributions that linear classical routers fail to separate efficiently. Furthermore, we analyze the architecture’s robustness against simulated quantum noise, confirming its feasibility for near-term intermediate-scale quantum (NISQ) hardware. We discuss practical applications in federated learning, privacy-preserving machine learning, and adaptive systems that could benefit from this quantum-enhanced routing paradigm.

[499] Statistical and Machine Learning Analysis of Traffic Accidents on US 158 in Currituck County: A Comparison with HSM Predictions

Jennifer Sawyer, Julian Allagan

Main category: cs.LG

TL;DR: This paper extends previous hotspot analysis on US 158 by integrating advanced statistical methods, machine learning, and spatial modeling to analyze 5 years of crash data, identifying patterns and predicting injury severity for targeted safety interventions.

Details

Motivation: To extend previous hotspot analysis by integrating advanced techniques beyond basic statistics, provide actionable safety insights for US 158, and contribute methodological advancements to rural highway safety analysis.

Method: Applied Kernel Density Estimation (KDE), Negative Binomial Regression, Random Forest classification, Highway Safety Manual (HSM) Safety Performance Function comparisons, and Moran’s I test to analyze 2019-2023 crash data from an 8.4-mile stretch of US 158.

Result: Random Forest classifier achieved 67% accuracy in predicting injury severity (outperforming HSM SPF), spatial clustering confirmed via Moran’s I (I=0.32, p<0.001), KDE revealed hotspots near major intersections, validating and extending earlier hotspot methods.

Conclusion: The integrated approach provides comprehensive temporal and spatial crash patterns, supports targeted safety interventions on US 158, and demonstrates methodological advancement beyond basic statistical techniques for rural highway safety analysis.

Abstract: This study extends previous hotspot and Chi-Square analysis by Sawyer \cite{sawyer2025hotspot} by integrating advanced statistical analysis, machine learning, and spatial modeling techniques to analyze five years (2019–2023) of traffic accident data from an 8.4-mile stretch of US 158 in Currituck County, NC. Building upon foundational statistical work, we apply Kernel Density Estimation (KDE), Negative Binomial Regression, Random Forest classification, and Highway Safety Manual (HSM) Safety Performance Function (SPF) comparisons to identify comprehensive temporal and spatial crash patterns. A Random Forest classifier predicts injury severity with 67% accuracy, outperforming HSM SPF. Spatial clustering is confirmed via Moran’s I test ($I = 0.32$, $p < 0.001$), and KDE analysis reveals hotspots near major intersections, validating and extending earlier hotspot identification methods. These results support targeted interventions to improve traffic safety on this vital transportation corridor. Our objective is to provide actionable insights for improving safety on US 158 while contributing to the broader understanding of rural highway safety analysis through methodological advancement beyond basic statistical techniques.

[500] PDx – Adaptive Credit Risk Forecasting Model in Digital Lending using Machine Learning Operations

Sultan Amed, Chan Yu Hang, Sayantan Banerjee

Main category: cs.LG

TL;DR: PDx is an MLOps-driven adaptive decision system for credit risk forecasting that addresses limitations of static PD models by implementing continuous monitoring, retraining, and validation through a champion-challenger framework to maintain accuracy against changing borrower behavior.

Details

Motivation: Conventional PD models focus on predictive accuracy during development but become static in production, degrading over time as borrower behavior changes. Financial institutions struggle with transitioning ML models to production and maintaining their health in dynamic lending environments.

Method: PDx uses a dynamic, end-to-end model lifecycle management approach with MLOps pipeline integration. It implements a champion-challenger framework for regular model updates, recalibrating parameters with latest data and selecting best-performing models through out-of-time validation to handle data drift.

Result: Decision tree-based ensemble models consistently outperform others in default classification but require frequent updates. Linear models and neural networks show greater performance degradation. PDx mitigates value erosion for digital lenders, especially in short-term, small-ticket loans with rapid borrower behavior shifts.

Conclusion: PDx effectively addresses the limitations of static PD models through adaptive MLOps-driven decision making, demonstrating scalability and adaptability across peer-to-peer lending, business loans, and auto loan datasets for modern credit risk forecasting.

Abstract: This paper presents PDx, an adaptive, machine learning operations (MLOps) driven decision system for forecasting credit risk using probability of default (PD) modeling in digital lending. While conventional PD models prioritize predictive accuracy during model development with complex machine learning algorithms, they often overlook continuous adaptation to changing borrower behaviour, resulting in static models that degrade over time in production and generate inaccurate default predictions. Many financial institutes also find it difficult transitioning ML models from development environment to production and maintaining their health. With PDx we aimed to addresses these limitations using a dynamic, end-to-end model lifecycle management approach that integrates continuous model monitoring, retraining, and validation through a robust MLOps pipeline. We introduced a dynamic champion-challenger framework for PDx to regularly update baseline models to recalibrate independent parameters with the latest data and select the best-performing model through out-of-time validation, ensuring resilience against data drift and changing credit risk patterns. Our empirical analysis shows that decision tree-based ensemble models consistently outperform others in classifying defaulters but require frequent updates to sustain performance. Linear models (e.g., logistic regression) and neural networks exhibit greater performance degradation. The study demonstrate with PDx we can mitigates value erosion for digital lenders, particularly in short-term, small-ticket loans, where borrower behavior shifts rapidly. We have validated the effectiveness of PDx using datasets from peer-to-peer lending, business loans, and auto loans, demonstrating its scalability and adaptability for modern credit risk forecasting.

[501] LLMBoost: Make Large Language Models Stronger with Boosting

Zehao Chen, Tianxiang Ai, Yifei Li, Gongxun Li, Yuyang Wei, Wang Zhou, Guanghui Li, Bin Yu, Zhijun Chen, Hailong Sun, Fuzhen Zhuang, Jianxin Li, Deqing Wang, Yikun Ban

Main category: cs.LG

TL;DR: LLMBoost is an ensemble fine-tuning framework that leverages intermediate hidden states across LLMs using cross-model attention, chain training, and near-parallel inference to boost performance efficiently.

Details

Motivation: Existing ensemble methods treat LLMs as black boxes, combining only inputs or final outputs while ignoring rich internal representations and cross-model interactions, limiting performance gains and efficiency.

Method: Three key innovations: 1) Cross-model attention mechanism for successor models to access and fuse hidden states from predecessors; 2) Chain training paradigm with error-suppression objective for progressive fine-tuning; 3) Near-parallel inference paradigm that pipelines hidden states layer-by-layer for efficient decoding.

Result: Extensive experiments on commonsense reasoning and arithmetic reasoning tasks show LLMBoost consistently boosts accuracy while reducing inference latency. Theoretical analysis proves sequential integration guarantees monotonic improvements under bounded correction assumptions.

Conclusion: LLMBoost breaks the black-box barrier in ensemble learning by leveraging intermediate states, enabling hierarchical error correction and knowledge transfer while maintaining inference efficiency approaching single-model decoding.

Abstract: Ensemble learning of LLMs has emerged as a promising alternative to enhance performance, but existing approaches typically treat models as black boxes, combining the inputs or final outputs while overlooking the rich internal representations and interactions across models.In this work, we introduce LLMBoost, a novel ensemble fine-tuning framework that breaks this barrier by explicitly leveraging intermediate states of LLMs. Inspired by the boosting paradigm, LLMBoost incorporates three key innovations. First, a cross-model attention mechanism enables successor models to access and fuse hidden states from predecessors, facilitating hierarchical error correction and knowledge transfer. Second, a chain training paradigm progressively fine-tunes connected models with an error-suppression objective, ensuring that each model rectifies the mispredictions of its predecessor with minimal additional computation. Third, a near-parallel inference paradigm design pipelines hidden states across models layer by layer, achieving inference efficiency approaching single-model decoding. We further establish the theoretical foundations of LLMBoost, proving that sequential integration guarantees monotonic improvements under bounded correction assumptions. Extensive experiments on commonsense reasoning and arithmetic reasoning tasks demonstrate that LLMBoost consistently boosts accuracy while reducing inference latency.

[502] Optimistic Feasible Search for Closed-Loop Fair Threshold Decision-Making

Wenzhang Du

Main category: cs.LG

TL;DR: OFS is an online learning algorithm for threshold policies under fairness constraints that uses optimistic confidence bounds to balance reward maximization with constraint satisfaction in closed-loop systems.

Details

Motivation: Closed-loop decision systems (like lending or risk assessment) face fairness constraints (demographic parity) and service constraints while creating feedback effects that change population composition over time, leading to non-stationary data and potential disparity amplification.

Method: Optimistic Feasible Search (OFS): grid-based method maintaining confidence bounds for reward and constraint residuals for each candidate threshold. Each round selects threshold that appears feasible under confidence bounds and maximizes optimistic reward; if none feasible, selects threshold minimizing optimistic constraint violation.

Result: OFS achieves higher reward with smaller cumulative constraint violation than unconstrained and primal-dual bandit baselines across synthetic and semi-synthetic benchmarks (German Credit, COMPAS). Performs near-oracle relative to best feasible fixed threshold.

Conclusion: OFS effectively learns threshold policies under fairness constraints in closed-loop systems, particularly suitable for low-dimensional, interpretable policy classes where discretization is natural.

Abstract: Closed-loop decision-making systems (e.g., lending, screening, or recidivism risk assessment) often operate under fairness and service constraints while inducing feedback effects: decisions change who appears in the future, yielding non-stationary data and potentially amplifying disparities. We study online learning of a one-dimensional threshold policy from bandit feedback under demographic parity (DP) and, optionally, service-rate constraints. The learner observes only a scalar score each round and selects a threshold; reward and constraint residuals are revealed only for the chosen threshold. We propose Optimistic Feasible Search (OFS), a simple grid-based method that maintains confidence bounds for reward and constraint residuals for each candidate threshold. At each round, OFS selects a threshold that appears feasible under confidence bounds and, among those, maximizes optimistic reward; if no threshold appears feasible, OFS selects the threshold minimizing optimistic constraint violation. This design directly targets feasible high-utility thresholds and is particularly effective for low-dimensional, interpretable policy classes where discretization is natural. We evaluate OFS on (i) a synthetic closed-loop benchmark with stable contraction dynamics and (ii) two semi-synthetic closed-loop benchmarks grounded in German Credit and COMPAS, constructed by training a score model and feeding group-dependent acceptance decisions back into population composition. Across all environments, OFS achieves higher reward with smaller cumulative constraint violation than unconstrained and primal-dual bandit baselines, and is near-oracle relative to the best feasible fixed threshold under the same sweep procedure. Experiments are reproducible and organized with double-blind-friendly relative outputs.

[503] LangPrecip: Language-Aware Multimodal Precipitation Nowcasting

Xudong Ling, Tianxi Huang, Qian Dong, Tao He, Chaorong Li, Guiduo Duan

Main category: cs.LG

TL;DR: LangPrecip: A language-aware multimodal nowcasting framework that uses meteorological text as semantic motion constraints for precipitation forecasting, achieving significant improvements in heavy rainfall prediction.

Details

Motivation: Short-term precipitation nowcasting is uncertain and under-constrained, especially for extreme weather events. Existing generative approaches rely primarily on visual conditioning, leaving future motion weakly constrained and ambiguous.

Method: Proposes LangPrecip, a language-aware multimodal framework that treats meteorological text as semantic motion constraints. Formulates nowcasting as semantically constrained trajectory generation under Rectified Flow paradigm, enabling efficient integration of textual and radar information in latent space. Also introduces LangPrecip-160k dataset with 160k paired radar sequences and motion descriptions.

Result: Experiments on Swedish and MRMS datasets show consistent improvements over state-of-the-art methods, achieving over 60% and 19% gains in heavy-rainfall CSI at 80-minute lead time.

Conclusion: The proposed language-aware multimodal approach effectively constrains precipitation evolution using textual motion descriptions, significantly improving nowcasting accuracy for extreme weather events.

Abstract: Short-term precipitation nowcasting is an inherently uncertain and under-constrained spatiotemporal forecasting problem, especially for rapidly evolving and extreme weather events. Existing generative approaches rely primarily on visual conditioning, leaving future motion weakly constrained and ambiguous. We propose a language-aware multimodal nowcasting framework(LangPrecip) that treats meteorological text as a semantic motion constraint on precipitation evolution. By formulating nowcasting as a semantically constrained trajectory generation problem under the Rectified Flow paradigm, our method enables efficient and physically consistent integration of textual and radar information in latent space.We further introduce LangPrecip-160k, a large-scale multimodal dataset with 160k paired radar sequences and motion descriptions. Experiments on Swedish and MRMS datasets show consistent improvements over state-of-the-art methods, achieving over 60 % and 19% gains in heavy-rainfall CSI at an 80-minute lead time.

[504] Decomposing Uncertainty in Probabilistic Knowledge Graph Embeddings: Why Entity Variance Is Not Enough

Chorok Lee

Main category: cs.LG

TL;DR: Probabilistic KG embeddings have relation-agnostic uncertainty that fails to distinguish between emerging entities vs novel relational contexts. The paper proves this limitation, decomposes uncertainty into semantic (entity variance) and structural (entity-relation co-occurrence) components, and proposes CAGP method that combines both for superior OOD detection.

Details

Motivation: Existing probabilistic knowledge graph embeddings use entity-level variances to quantify uncertainty, but these are relation-agnostic - entities get identical uncertainty regardless of relational context. This conflates two distinct OOD phenomena: emerging entities (rare/poorly-learned) vs novel relational contexts (familiar entities in unobserved relationships). The paper aims to address this fundamental limitation.

Method: The paper formalizes uncertainty decomposition into: 1) semantic uncertainty from entity embedding variance (detects emerging entities), and 2) structural uncertainty from entity-relation co-occurrence (detects novel contexts). It proves these signals are non-redundant. The proposed method CAGP (Context-Aware Gaussian Process) combines semantic and structural uncertainty via learned weights.

Result: Empirical validation shows 100% of novel-context triples have frequency-matched in-distribution counterparts. CAGP achieves 0.94-0.99 AUROC on temporal OOD detection across benchmarks (60-80% relative improvement over baselines). On selective prediction, reduces errors by 43% at 85% answer rate. Complete frequency overlap confirmed on FB15k-237, WN18RR, YAGO3-10 datasets.

Conclusion: Relation-agnostic uncertainty in probabilistic KG embeddings is fundamentally limited for detecting novel relational contexts. The proposed uncertainty decomposition into semantic and structural components is theoretically sound and empirically effective, with CAGP significantly outperforming existing methods on OOD detection and selective prediction tasks.

Abstract: Probabilistic knowledge graph embeddings represent entities as distributions, using learned variances to quantify epistemic uncertainty. We identify a fundamental limitation: these variances are relation-agnostic, meaning an entity receives identical uncertainty regardless of relational context. This conflates two distinct out-of-distribution phenomena that behave oppositely: emerging entities (rare, poorly-learned) and novel relational contexts (familiar entities in unobserved relationships). We prove an impossibility result: any uncertainty estimator using only entity-level statistics independent of relation context achieves near-random OOD detection on novel contexts. We empirically validate this on three datasets, finding 100 percent of novel-context triples have frequency-matched in-distribution counterparts. This explains why existing probabilistic methods achieve 0.99 AUROC on random corruptions but only 0.52-0.64 on temporal distribution shift. We formalize uncertainty decomposition into complementary components: semantic uncertainty from entity embedding variance (detecting emerging entities) and structural uncertainty from entity-relation co-occurrence (detecting novel contexts). Our main theoretical result proves these signals are non-redundant, and that any convex combination strictly dominates either signal alone. Our method (CAGP) combines semantic and structural uncertainty via learned weights, achieving 0.94-0.99 AUROC on temporal OOD detection across multiple benchmarks, a 60-80 percent relative improvement over relation-agnostic baselines. Empirical validation confirms complete frequency overlap on three datasets (FB15k-237, WN18RR, YAGO3-10). On selective prediction, our method reduces errors by 43 percent at 85 percent answer rate.

[505] Expert System for Bitcoin Forecasting: Integrating Global Liquidity via TimeXer Transformers

Sravan Karthick T

Main category: cs.LG

TL;DR: TimeXer-Exog model with Global M2 Liquidity conditioning outperforms univariate models for Bitcoin price forecasting, reducing MSE by 89% at 70-day horizon.

Details

Motivation: Bitcoin price forecasting suffers from extreme volatility and non-stationarity, making traditional univariate time-series models ineffective for long horizons. There's a critical gap in incorporating macroeconomic factors as leading indicators.

Method: Integrated Global M2 Liquidity (aggregated from 18 major economies) as a leading exogenous variable with 12-week lag structure. Used TimeXer architecture to create liquidity-conditioned forecasting model (TimeXer-Exog) and compared against LSTM, N-BEATS, PatchTST, and univariate TimeXer benchmarks.

Result: At 70-day forecast horizon, TimeXer-Exog achieved MSE of 1.08e8, outperforming univariate TimeXer baseline by over 89%. Explicit macroeconomic conditioning significantly stabilized long-horizon forecasts.

Conclusion: Conditioning deep learning models on global liquidity provides substantial improvements in long-horizon Bitcoin price forecasting, demonstrating the importance of macroeconomic factors in cryptocurrency prediction.

Abstract: Bitcoin price forecasting is characterized by extreme volatility and non-stationarity, often defying traditional univariate time-series models over long horizons. This paper addresses a critical gap by integrating Global M2 Liquidity, aggregated from 18 major economies, as a leading exogenous variable with a 12-week lag structure. Using the TimeXer architecture, we compare a liquidity-conditioned forecasting model (TimeXer-Exog) against state-of-the-art benchmarks including LSTM, N-BEATS, PatchTST, and a standard univariate TimeXer. Experiments conducted on daily Bitcoin price data from January 2020 to August 2025 demonstrate that explicit macroeconomic conditioning significantly stabilizes long-horizon forecasts. At a 70-day forecast horizon, the proposed TimeXer-Exog model achieves a mean squared error (MSE) 1.08e8, outperforming the univariate TimeXer baseline by over 89 percent. These results highlight that conditioning deep learning models on global liquidity provides substantial improvements in long-horizon Bitcoin price forecasting.

[506] The Effectiveness of Approximate Regularized Replay for Efficient Supervised Fine-Tuning of Large Language Models

Matthew Riemer, Erik Miehling, Miao Liu, Djallel Bouneffouf, Murray Campbell

Main category: cs.LG

TL;DR: LoRA-based fine-tuning can catastrophically degrade model capabilities, but simple regularization techniques can virtually eliminate this problem while preserving general knowledge.

Details

Motivation: Parameter-efficient fine-tuning methods like LoRA are widely used but can severely degrade model capabilities even with small datasets and few training steps. The paper aims to address this catastrophic degradation problem.

Method: Proposes a regularized approximate replay approach that: 1) penalizes KL divergence with respect to the initial model, and 2) interleaves next token prediction data from a similar open access corpus to pre-training data. Applied to Qwen instruction-tuned models.

Result: The proposed recipe preserves general knowledge in the model without hindering plasticity to new tasks, with only modest computational overhead. It virtually eliminates the catastrophic degradation problem.

Conclusion: While straightforward LoRA-based fine-tuning can fail spectacularly, small tweaks to training procedure with minimal overhead can effectively preserve model capabilities during parameter-efficient fine-tuning.

Abstract: Although parameter-efficient fine-tuning methods, such as LoRA, only modify a small subset of parameters, they can have a significant impact on the model. Our instruction-tuning experiments show that LoRA-based supervised fine-tuning can catastrophically degrade model capabilities, even when trained on very small datasets for relatively few steps. With that said, we demonstrate that while the most straightforward approach (that is likely the most used in practice) fails spectacularly, small tweaks to the training procedure with very little overhead can virtually eliminate the problem. Particularly, in this paper we consider a regularized approximate replay approach which penalizes KL divergence with respect to the initial model and interleaves in data for next token prediction from a different, yet similar, open access corpus to what was used in pre-training. When applied to Qwen instruction-tuned models, we find that this recipe preserves general knowledge in the model without hindering plasticity to new tasks by adding a modest amount of computational overhead.

[507] Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

Bruno Mlodozeniec, Pierre Ablin, Louis Béthune, Dan Busbridge, Michal Klein, Jason Ramapuram, Marco Cuturi

Main category: cs.LG

TL;DR: The paper proposes Complete^{(d)} Parameterisation for unified hyperparameter scaling across width, depth, batch size, and training duration, enabling per-module hyperparameter optimization and transfer for large-scale models.

Details

Motivation: Hyperparameter tuning significantly impacts training stability and performance of large models. While existing methods like μP enable hyperparameter transfer across model sizes, they don't handle scaling along multiple axes (width, depth, batch size, training duration) or support per-module hyperparameter optimization.

Method: Proposes Complete^{(d)} Parameterisation that unifies scaling across width and depth (adapting CompleteP) plus batch size and training duration. Investigates per-module hyperparameter optimization, characterizes challenges in high-dimensional hyperparameter landscapes, and provides practical optimization guidelines.

Result: Demonstrates that with proper parameterisation, hyperparameter transfer works even in per-module regime. Shows significant training speed improvements in Large Language Models using transferred per-module hyperparameters across learning rates, AdamW parameters, weight decay, initialization scales, and residual block multipliers.

Conclusion: Complete^{(d)} Parameterisation enables effective hyperparameter transfer across multiple scaling dimensions and supports per-module optimization, providing practical solutions for efficient large-scale model training with demonstrated speed improvements.

Abstract: Hyperparameter tuning can dramatically impact training stability and final performance of large-scale models. Recent works on neural network parameterisations, such as $μ$P, have enabled transfer of optimal global hyperparameters across model sizes. These works propose an empirical practice of search for optimal global base hyperparameters at a small model size, and transfer to a large size. We extend these works in two key ways. To handle scaling along most important scaling axes, we propose the Complete$^{(d)}$ Parameterisation that unifies scaling in width and depth – using an adaptation of CompleteP – as well as in batch-size and training duration. Secondly, with our parameterisation, we investigate per-module hyperparameter optimisation and transfer. We characterise the empirical challenges of navigating the high-dimensional hyperparameter landscape, and propose practical guidelines for tackling this optimisation problem. We demonstrate that, with the right parameterisation, hyperparameter transfer holds even in the per-module hyperparameter regime. Our study covers an extensive range of optimisation hyperparameters of modern models: learning rates, AdamW parameters, weight decay, initialisation scales, and residual block multipliers. Our experiments demonstrate significant training speed improvements in Large Language Models with the transferred per-module hyperparameters.

[508] BLISS: Bandit Layer Importance Sampling Strategy for Efficient Training of Graph Neural Networks

Omar Alsaqa, Linh Thi Hoang, Muhammed Fatih Balin

Main category: cs.LG

TL;DR: BLISS uses multi-armed bandits for dynamic node sampling in GNNs, improving efficiency while maintaining accuracy.

Details

Motivation: GNNs face computational bottlenecks on large graphs due to processing all neighbors for each node, creating memory and performance challenges.

Method: BLISS (Bandit Layer Importance Sampling Strategy) uses multi-armed bandits to dynamically select the most informative nodes at each layer, balancing exploration and exploitation for comprehensive graph coverage.

Result: BLISS maintains or exceeds the accuracy of full-batch training while being computationally efficient, and works with both GCNs and GATs by adapting to their aggregation mechanisms.

Conclusion: BLISS provides an effective adaptive sampling approach that overcomes computational bottlenecks in large-scale graph learning while preserving model performance.

Abstract: Graph Neural Networks (GNNs) are powerful tools for learning from graph-structured data, but their application to large graphs is hindered by computational costs. The need to process every neighbor for each node creates memory and computational bottlenecks. To address this, we introduce BLISS, a Bandit Layer Importance Sampling Strategy. It uses multi-armed bandits to dynamically select the most informative nodes at each layer, balancing exploration and exploitation to ensure comprehensive graph coverage. Unlike existing static sampling methods, BLISS adapts to evolving node importance, leading to more informed node selection and improved performance. It demonstrates versatility by integrating with both Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs), adapting its selection policy to their specific aggregation mechanisms. Experiments show that BLISS maintains or exceeds the accuracy of full-batch training.

[509] Causality-Inspired Safe Residual Correction for Multivariate Time Series

Jianxiang Xie, Yuncheng Hua

Main category: cs.LG

TL;DR: CRC is a causality-inspired safe residual correction framework that ensures non-degradation in multivariate forecasting by using direction-aware structure decoupling and strict safety mechanisms.

Details

Motivation: Modern multivariate forecasters (Transformers, GNNs) suffer from systematic errors at specific variables/horizons and lack guarantees against performance degradation in deployment. Existing post-hoc correction methods are greedy and can "help in the wrong way" by overcorrecting reliable predictions.

Method: CRC uses a causality-inspired encoder to expose direction-aware structure by decoupling self- and cross-variable dynamics, and a hybrid corrector to model residual errors. Correction is governed by a strict four-fold safety mechanism that prevents harmful updates.

Result: Experiments across multiple datasets and forecasting backbones show CRC consistently improves accuracy while ensuring exceptionally high non-degradation rates (NDR) through its safety mechanisms.

Conclusion: CRC provides a plug-and-play correction framework suited for safe and reliable deployment by addressing the critical “safety gap” in multivariate forecasting systems.

Abstract: While modern multivariate forecasters such as Transformers and GNNs achieve strong benchmark performance, they often suffer from systematic errors at specific variables or horizons and, critically, lack guarantees against performance degradation in deployment. Existing post-hoc residual correction methods attempt to fix these errors, but are inherently greedy: although they may improve average accuracy, they can also “help in the wrong way” by overcorrecting reliable predictions and causing local failures in unseen scenarios. To address this critical “safety gap,” we propose CRC (Causality-inspired Safe Residual Correction), a plug-and-play framework explicitly designed to ensure non-degradation. CRC follows a divide-and-conquer philosophy: it employs a causality-inspired encoder to expose direction-aware structure by decoupling self- and cross-variable dynamics, and a hybrid corrector to model residual errors. Crucially, the correction process is governed by a strict four-fold safety mechanism that prevents harmful updates. Experiments across multiple datasets and forecasting backbones show that CRC consistently improves accuracy, while an in-depth ablation study confirms that its core safety mechanisms ensure exceptionally high non-degradation rates (NDR), making CRC a correction framework suited for safe and reliable deployment.

[510] AFA-LoRA: Enabling Non-Linear Adaptations in LoRA with Activation Function Annealing

Jiacheng Li, Jianchao Tan, Zhidong Yang, Feiye Huo, Yerui Sun, Yuchen Xie, Xunliang Cai

Main category: cs.LG

TL;DR: AFA-LoRA introduces an annealed activation function to add non-linear expressivity to LoRA while maintaining mergeability, bridging the gap between linear and non-linear fine-tuning.

Details

Motivation: LoRA's linear adaptation limits expressive power, creating a performance gap between linear and non-linear training methods. The authors aim to enhance LoRA's capabilities while preserving its practical mergeability feature.

Method: AFA-LoRA uses an annealed activation function that transitions from non-linear to linear transformation during training, allowing initial strong representational capabilities before converging to a mergeable linear form.

Result: AFA-LoRA reduces the performance gap between LoRA and full-parameter training across supervised fine-tuning, reinforcement learning, and speculative decoding tasks.

Conclusion: The work enables a more powerful and practical paradigm of parameter-efficient adaptation by adding non-linear expressivity to LoRA while maintaining seamless mergeability.

Abstract: Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method. However, its linear adaptation process limits its expressive power. This means there is a gap between the expressive power of linear training and non-linear training. To bridge this gap, we propose AFA-LoRA, a novel training strategy that brings non-linear expressivity to LoRA while maintaining its seamless mergeability. Our key innovation is an annealed activation function that transitions from a non-linear to a linear transformation during training, allowing the adapter to initially adopt stronger representational capabilities before converging to a mergeable linear form. We implement our method on supervised fine-tuning, reinforcement learning, and speculative decoding. The results show that AFA-LoRA reduces the performance gap between LoRA and full-parameter training. This work enables a more powerful and practical paradigm of parameter-efficient adaptation.

[511] AMBIT: Augmenting Mobility Baselines with Interpretable Trees

Qizhi Wang

Main category: cs.LG

TL;DR: AMBIT is a gray-box framework that combines physical mobility models with interpretable tree models for OD flow prediction, achieving both high accuracy and interpretability.

Details

Motivation: There's a conflict between high accuracy and clear interpretability in practical OD flow prediction deployments. Physical models are fragile at high temporal resolutions, while black-box models lack interpretability needed for urban decision-making.

Method: Developed AMBIT framework that first audits classical spatial interaction models on NYC taxi OD data, identifies PPML gravity as strongest baseline, then builds residual learners using gradient-boosted trees and SHAP analysis on top of physical baselines.

Result: Physics-grounded residuals approach accuracy of strong tree-based predictors while retaining interpretable structure. POI-anchored residuals are consistently competitive and most robust under spatial generalization.

Conclusion: AMBIT provides a reproducible pipeline with rich diagnostics and spatial error analysis that balances accuracy and interpretability for urban decision-making applications.

Abstract: Origin-destination (OD) flow prediction remains a core task in GIS and urban analytics, yet practical deployments face two conflicting needs: high accuracy and clear interpretability. This paper develops AMBIT, a gray-box framework that augments physical mobility baselines with interpretable tree models. We begin with a comprehensive audit of classical spatial interaction models on a year-long, hourly NYC taxi OD dataset. The audit shows that most physical models are fragile at this temporal resolution; PPML gravity is the strongest physical baseline, while constrained variants improve when calibrated on full OD margins but remain notably weaker. We then build residual learners on top of physical baselines using gradient-boosted trees and SHAP analysis, demonstrating that (i) physics-grounded residuals approach the accuracy of a strong tree-based predictor while retaining interpretable structure, and (ii) POI-anchored residuals are consistently competitive and most robust under spatial generalization. We provide a reproducible pipeline, rich diagnostics, and spatial error analysis designed for urban decision-making.

[512] GLUE: Gradient-free Learning to Unify Experts

Jong-Ik Park, Shreyas Chaudhari, Srinivasa Pranav, Carlee Joe-Wong, José M. F. Moura

Main category: cs.LG

TL;DR: GLUE is a gradient-free method that learns optimal convex combinations of pretrained expert models to create a strong initialization for new target domains, outperforming heuristic blending methods and matching backpropagation-based approaches with much lower computational cost.

Details

Motivation: In deployed systems with multiple pretrained specialist models, new target domains require domain expansion. Existing methods use heuristic blending (based on data size or proxy metrics) which often yields poor target-domain accuracy, while learning-based blending requires expensive full backpropagation through the network.

Method: GLUE (Gradient-free Learning To Unify Experts) initializes the target model as a convex combination of fixed experts and learns the mixture coefficients via a gradient-free two-point (SPSA) update that requires only two forward passes per step, avoiding expensive backpropagation.

Result: Across three datasets and three network architectures, GLUE produces a single prior that can be fine-tuned effectively to outperform baselines. It improves test accuracy by up to 8.5% over data-size weighting and up to 9.1% over proxy-metric selection, while matching backpropagation-based full-gradient mixing performance within 1.4%.

Conclusion: GLUE provides an efficient gradient-free approach to unify expert models for domain expansion, achieving strong performance with minimal computational overhead compared to both heuristic methods and expensive gradient-based approaches.

Abstract: In many deployed systems (multilingual ASR, cross-hospital imaging, region-specific perception), multiple pretrained specialist models coexist. Yet, new target domains often require domain expansion: a generalized model that performs well beyond any single specialist’s domain. Given such a new target domain, prior works seek a single strong initialization prior for the model parameters by first blending expert models to initialize a target model. However, heuristic blending – using coefficients based on data size or proxy metrics – often yields lower target-domain test accuracy, and learning the coefficients on the target loss typically requires computationally-expensive full backpropagation through the network. We propose GLUE, Gradient-free Learning To Unify Experts, which initializes the target model as a convex combination of fixed experts, learning the mixture coefficients of this combination via a gradient-free two-point (SPSA) update that requires only two forward passes per step. Across experiments on three datasets and three network architectures, GLUE produces a single prior that can be fine-tuned effectively to outperform baselines. GLUE improves test accuracy by up to 8.5% over data-size weighting and by up to 9.1% over proxy-metric selection. GLUE either outperforms backpropagation-based full-gradient mixing or matches its performance within 1.4%.

[513] The Bayesian Geometry of Transformer Attention

Naman Aggarwal, Siddhartha R. Dalal, Vishal Misra

Main category: cs.LG

TL;DR: Transformers implement Bayesian inference through geometric mechanisms: residual streams store beliefs, feed-forward networks update posteriors, and attention provides routing. This is verified using controlled ‘Bayesian wind tunnels’ where true posteriors are known and memorization is impossible.

Details

Motivation: To rigorously verify whether transformers perform Bayesian reasoning in context, which has been impossible with natural data due to lack of analytic posteriors and conflation with memorization in large models.

Method: Construct ‘Bayesian wind tunnels’ - controlled environments where true posterior is known in closed form and memorization is provably impossible. Test small transformers on two tasks: bijection elimination and Hidden Markov Model state tracking, using geometric diagnostics to analyze mechanisms.

Result: Small transformers reproduce Bayesian posteriors with 10^-3-10^-4 bit accuracy, while capacity-matched MLPs fail by orders of magnitude. Transformers implement Bayesian inference through specific geometric mechanisms: residual streams serve as belief substrate, feed-forward networks perform posterior updates, and attention provides content-addressable routing.

Conclusion: Hierarchical attention realizes Bayesian inference by geometric design, explaining both the necessity of attention and failure of flat architectures. Bayesian wind tunnels provide foundation for mechanistically connecting small, verifiable systems to reasoning in large language models.

Abstract: Transformers often appear to perform Bayesian reasoning in context, but verifying this rigorously has been impossible: natural data lack analytic posteriors, and large models conflate reasoning with memorization. We address this by constructing \emph{Bayesian wind tunnels} – controlled environments where the true posterior is known in closed form and memorization is provably impossible. In these settings, small transformers reproduce Bayesian posteriors with $10^{-3}$-$10^{-4}$ bit accuracy, while capacity-matched MLPs fail by orders of magnitude, establishing a clear architectural separation. Across two tasks – bijection elimination and Hidden Markov Model (HMM) state tracking – we find that transformers implement Bayesian inference through a consistent geometric mechanism: residual streams serve as the belief substrate, feed-forward networks perform the posterior update, and attention provides content-addressable routing. Geometric diagnostics reveal orthogonal key bases, progressive query-key alignment, and a low-dimensional value manifold parameterized by posterior entropy. During training this manifold unfurls while attention patterns remain stable, a \emph{frame-precision dissociation} predicted by recent gradient analyses. Taken together, these results demonstrate that hierarchical attention realizes Bayesian inference by geometric design, explaining both the necessity of attention and the failure of flat architectures. Bayesian wind tunnels provide a foundation for mechanistically connecting small, verifiable systems to reasoning phenomena observed in large language models.

[514] Collaborative Optimization of Multiclass Imbalanced Learning: Density-Aware and Region-Guided Boosting

Chuantao Li, Zhi Li, Jiahao Xu, Jie Li, Sheng Li

Main category: cs.LG

TL;DR: A collaborative optimization boosting model for multiclass imbalanced learning that integrates density and confidence factors to jointly optimize imbalanced learning and model training through noise-resistant weight updates and dynamic sampling.

Details

Motivation: Existing studies haven't explored collaborative optimization between imbalanced learning and model training, which limits further performance improvements in handling class imbalance classification bias.

Method: Proposes a collaborative optimization boosting model that integrates density factor and confidence factor to design a noise-resistant weight update mechanism and dynamic sampling strategy. These modules work together to orchestrate weight updates, sample region partitioning, and region-guided sampling.

Result: Extensive experiments on 20 public imbalanced datasets show the proposed model significantly outperforms eight state-of-the-art baselines.

Conclusion: The study successfully achieves collaborative optimization of imbalanced learning and model training through a simple but effective boosting model that integrates density and confidence factors with noise-resistant mechanisms.

Abstract: Numerous studies attempt to mitigate classification bias caused by class imbalance. However, existing studies have yet to explore the collaborative optimization of imbalanced learning and model training. This constraint hinders further performance improvements. To bridge this gap, this study proposes a collaborative optimization Boosting model of multiclass imbalanced learning. This model is simple but effective by integrating the density factor and the confidence factor, this study designs a noise-resistant weight update mechanism and a dynamic sampling strategy. Rather than functioning as independent components, these modules are tightly integrated to orchestrate weight updates, sample region partitioning, and region-guided sampling. Thus, this study achieves the collaborative optimization of imbalanced learning and model training. Extensive experiments on 20 public imbalanced datasets demonstrate that the proposed model significantly outperforms eight state-of-the-art baselines. The code for the proposed model is available at: https://github.com/ChuantaoLi/DARG.

[515] Toward Real-World IoT Security: Concept Drift-Resilient IoT Botnet Detection via Latent Space Representation Learning and Alignment

Hassan Wasswa, Timothy Lynar

Main category: cs.LG

TL;DR: A framework for adaptive IoT threat detection that avoids continuous retraining by using latent-space alignment and graph neural networks to handle concept drift.

Details

Motivation: Current AI-based IoT threat detection models rely on static datasets and require frequent retraining, which is computationally expensive and suffers from catastrophic forgetting when dealing with dynamic real-world IoT NetFlow traffic affected by concept drift.

Method: Train a classifier once on historical traffic in latent space, then use an alignment model to map incoming traffic to this learned space. Transform latent representations into graph structures to capture inter-instance relationships, and classify using graph neural networks.

Result: Experimental evaluations on real-world heterogeneous IoT traffic datasets show the framework maintains robust detection performance under concept drift without continuous retraining.

Conclusion: The proposed scalable framework enables practical deployment in dynamic, large-scale IoT environments by eliminating the need for continuous classifier retraining while preserving knowledge of previously observed attacks.

Abstract: Although AI-based models have achieved high accuracy in IoT threat detection, their deployment in enterprise environments is constrained by reliance on stationary datasets that fail to reflect the dynamic nature of real-world IoT NetFlow traffic, which is frequently affected by concept drift. Existing solutions typically rely on periodic classifier retraining, resulting in high computational overhead and the risk of catastrophic forgetting. To address these challenges, this paper proposes a scalable framework for adaptive IoT threat detection that eliminates the need for continuous classifier retraining. The proposed approach trains a classifier once on latent-space representations of historical traffic, while an alignment model maps incoming traffic to the learned historical latent space prior to classification, thereby preserving knowledge of previously observed attacks. To capture inter-instance relationships among attack samples, the low-dimensional latent representations are further transformed into a graph-structured format and classified using a graph neural network. Experimental evaluations on real-world heterogeneous IoT traffic datasets demonstrate that the proposed framework maintains robust detection performance under concept drift. These results highlight the framework’s potential for practical deployment in dynamic and large-scale IoT environments.

[516] The Quest for Winning Tickets in Low-Rank Adapters

Hamed Damirchi, Cristian Rodriguez-Opazo, Ehsan Abbasnejad, Zhen Zhang, Javen Shi

Main category: cs.LG

TL;DR: Partial-LoRA identifies sparse subnetworks within LoRA adapters that match dense adapter performance, reducing trainable parameters by up to 87% while maintaining or improving accuracy across vision and language tasks.

Details

Motivation: The paper investigates whether the Lottery Ticket Hypothesis extends to parameter-efficient fine-tuning (PEFT) methods like LoRA, aiming to understand if sparse subnetworks exist within adapters and to develop more efficient adaptation strategies for large pretrained models.

Method: The authors propose Partial-LoRA, which systematically identifies sparse subnetworks within LoRA adapters by focusing on how sparsity is applied across layers rather than specific weights. The method trains sparse low-rank adapters aligned with task-relevant subspaces of the pretrained model.

Result: Experiments across 8 vision and 12 language tasks in single-task and multi-task settings show that Partial-LoRA reduces trainable parameters by up to 87% while maintaining or improving accuracy compared to dense adapters. The effectiveness depends more on layer-wise sparsity distribution than specific weight selection.

Conclusion: The Lottery Ticket Hypothesis holds within LoRA adapters, revealing sparse subnetworks that match dense adapter performance. Partial-LoRA provides an efficient adaptation strategy that deepens theoretical understanding of transfer learning and opens new avenues for parameter-efficient fine-tuning.

Abstract: The Lottery Ticket Hypothesis (LTH) suggests that over-parameterized neural networks contain sparse subnetworks (“winning tickets”) capable of matching full model performance when trained from scratch. With the growing reliance on fine-tuning large pretrained models, we investigate whether LTH extends to parameter-efficient fine-tuning (PEFT), specifically focusing on Low-Rank Adaptation (LoRA) methods. Our key finding is that LTH holds within LoRAs, revealing sparse subnetworks that can match the performance of dense adapters. In particular, we find that the effectiveness of sparse subnetworks depends more on how much sparsity is applied in each layer than on the exact weights included in the subnetwork. Building on this insight, we propose Partial-LoRA, a method that systematically identifies said subnetworks and trains sparse low-rank adapters aligned with task-relevant subspaces of the pre-trained model. Experiments across 8 vision and 12 language tasks in both single-task and multi-task settings show that Partial-LoRA reduces the number of trainable parameters by up to 87%, while maintaining or improving accuracy. Our results not only deepen our theoretical understanding of transfer learning and the interplay between pretraining and fine-tuning but also open new avenues for developing more efficient adaptation strategies.

[517] Predicting LLM Correctness in Prosthodontics Using Metadata and Hallucination Signals

Lucky Susanto, Anasta Pranawijayana, Cortino Sukotjo, Soni Prasad, Derry Wijaya

Main category: cs.LG

TL;DR: Researchers developed a metadata-based approach to predict correctness of LLM responses on medical exams, achieving up to 7.14% accuracy improvement and 83.12% precision over baseline, but found methods not yet robust enough for high-stakes deployment.

Details

Motivation: LLMs are increasingly used in high-stakes domains like healthcare where hallucination risks are critical, but predicting whether an LLM's response is correct remains an underexplored problem despite efforts to detect and mitigate hallucinations.

Method: Analyzed GPT-4o and OSS-120B on multiple-choice prosthodontics exam, using metadata and hallucination signals across three prompting strategies to build correctness predictors for each (model, prompting) pair.

Result: Metadata-based approach improved accuracy by up to +7.14% and achieved 83.12% precision over baseline; actual hallucination strongly indicates incorrectness but metadata alone doesn’t reliably predict hallucination; prompting strategies significantly alter models’ internal behaviors despite not affecting overall accuracy.

Conclusion: The approach shows promise for developing reliability signals in LLMs but is not yet robust enough for critical high-stakes deployment, highlighting the complex relationship between metadata, hallucination, and prompting strategies.

Abstract: Large language models (LLMs) are increasingly adopted in high-stakes domains such as healthcare and medical education, where the risk of generating factually incorrect (i.e., hallucinated) information is a major concern. While significant efforts have been made to detect and mitigate such hallucinations, predicting whether an LLM’s response is correct remains a critical yet underexplored problem. This study investigates the feasibility of predicting correctness by analyzing a general-purpose model (GPT-4o) and a reasoning-centric model (OSS-120B) on a multiple-choice prosthodontics exam. We utilize metadata and hallucination signals across three distinct prompting strategies to build a correctness predictor for each (model, prompting) pair. Our findings demonstrate that this metadata-based approach can improve accuracy by up to +7.14% and achieve a precision of 83.12% over a baseline that assumes all answers are correct. We further show that while actual hallucination is a strong indicator of incorrectness, metadata signals alone are not reliable predictors of hallucination. Finally, we reveal that prompting strategies, despite not affecting overall accuracy, significantly alter the models’ internal behaviors and the predictive utility of their metadata. These results present a promising direction for developing reliability signals in LLMs but also highlight that the methods explored in this paper are not yet robust enough for critical, high-stakes deployment.

[518] Decomposing Task Vectors for Refined Model Editing

Hamed Damirchi, Ehsan Abbasnejad, Zhen Zhang, Javen Shi

Main category: cs.LG

TL;DR: A method to decompose task vectors into shared and unique components for more precise control over concept manipulation in pre-trained models, improving multi-task merging, style mixing, and toxicity reduction.

Details

Motivation: Task vectors allow combining behaviors without large datasets, but overlapping concepts cause interference during arithmetic operations, leading to unpredictable outcomes. Need better control over concept manipulation.

Method: Proposes principled decomposition method separating each task vector into two components: shared knowledge across multiple tasks and unique information specific to each task. Uses invariant subspaces across projections.

Result: Improves multi-task merging in image classification by 5% using shared components; enables clean style mixing in diffusion models without degradation; achieves 47% toxicity reduction in language models while preserving general knowledge performance.

Conclusion: Provides new framework for understanding and controlling task vector arithmetic, addressing fundamental limitations in model editing operations for more precise concept manipulation.

Abstract: Large pre-trained models have transformed machine learning, yet adapting these models effectively to exhibit precise, concept-specific behaviors remains a significant challenge. Task vectors, defined as the difference between fine-tuned and pre-trained model parameters, provide a mechanism for steering neural networks toward desired behaviors. This has given rise to large repositories dedicated to task vectors tailored for specific behaviors. The arithmetic operation of these task vectors allows for the seamless combination of desired behaviors without the need for large datasets. However, these vectors often contain overlapping concepts that can interfere with each other during arithmetic operations, leading to unpredictable outcomes. We propose a principled decomposition method that separates each task vector into two components: one capturing shared knowledge across multiple task vectors, and another isolating information unique to each specific task. By identifying invariant subspaces across projections, our approach enables more precise control over concept manipulation without unintended amplification or diminution of other behaviors. We demonstrate the effectiveness of our decomposition method across three domains: improving multi-task merging in image classification by 5% using shared components as additional task vectors, enabling clean style mixing in diffusion models without generation degradation by mixing only the unique components, and achieving 47% toxicity reduction in language models while preserving performance on general knowledge tasks by negating the toxic information isolated to the unique component. Our approach provides a new framework for understanding and controlling task vector arithmetic, addressing fundamental limitations in model editing operations.

[519] Towards Reliable Evaluation of Adversarial Robustness for Spiking Neural Networks

Jihang Wang, Dongcheng Zhao, Ruolin Chen, Qian Zhang, Yi Zeng

Main category: cs.LG

TL;DR: The paper proposes a more reliable framework for evaluating adversarial robustness in Spiking Neural Networks (SNNs) by addressing gradient vanishing issues with adaptive surrogate gradients and stable attack methods.

Details

Motivation: SNNs suffer from vanishing gradients due to binary spike activations, making gradient-based adversarial robustness evaluation unreliable. Existing surrogate gradient methods have unclear effectiveness under strong attacks, leading to potentially overestimated robustness claims.

Method: 1) Theoretical analysis of gradient vanishing in surrogate gradients; 2) Adaptive Sharpness Surrogate Gradient (ASSG) that evolves surrogate function shape based on input distribution during attacks; 3) Stable Adaptive Projected Gradient Descent (SA-PGD) attack with adaptive step size under L∞ constraint for faster, more stable convergence with imprecise gradients.

Result: The approach substantially increases attack success rates across diverse adversarial training schemes, SNN architectures, and neuron models. Experiments reveal current SNN robustness has been significantly overestimated, demonstrating the need for more dependable adversarial training methods.

Conclusion: The proposed framework provides a more generalized and reliable evaluation of SNN adversarial robustness, exposing vulnerabilities in current SNNs and highlighting the need for improved adversarial training approaches.

Abstract: Spiking Neural Networks (SNNs) utilize spike-based activations to mimic the brain’s energy-efficient information processing. However, the binary and discontinuous nature of spike activations causes vanishing gradients, making adversarial robustness evaluation via gradient descent unreliable. While improved surrogate gradient methods have been proposed, their effectiveness under strong adversarial attacks remains unclear. We propose a more reliable framework for evaluating SNN adversarial robustness. We theoretically analyze the degree of gradient vanishing in surrogate gradients and introduce the Adaptive Sharpness Surrogate Gradient (ASSG), which adaptively evolves the shape of the surrogate function according to the input distribution during attack iterations, thereby enhancing gradient accuracy while mitigating gradient vanishing. In addition, we design an adversarial attack with adaptive step size under the $L_\infty$ constraint-Stable Adaptive Projected Gradient Descent (SA-PGD), achieving faster and more stable convergence under imprecise gradients. Extensive experiments show that our approach substantially increases attack success rates across diverse adversarial training schemes, SNN architectures and neuron models, providing a more generalized and reliable evaluation of SNN adversarial robustness. The experimental results further reveal that the robustness of current SNNs has been significantly overestimated and highlighting the need for more dependable adversarial training methods.

[520] TimePerceiver: An Encoder-Decoder Framework for Generalized Time-Series Forecasting

Jaebin Lee, Hankook Lee

Main category: cs.LG

TL;DR: TimePerceiver is a unified encoder-decoder forecasting framework with aligned training strategy that generalizes time-series tasks to include extrapolation, interpolation, and imputation, outperforming SOTA baselines.

Details

Motivation: Current time-series forecasting research focuses too much on encoder design while treating prediction and training as separate concerns. There's a need for a holistic framework that integrates encoding, decoding, and training strategies.

Method: Generalizes forecasting to diverse temporal prediction objectives. Uses novel encoder-decoder architecture with latent bottleneck representations for encoding (capturing temporal and cross-channel dependencies) and learnable queries for decoding (retrieving relevant information for target timestamps).

Result: Extensive experiments show the framework consistently and significantly outperforms prior state-of-the-art baselines across a wide range of benchmark datasets.

Conclusion: TimePerceiver provides a unified approach that effectively integrates encoding, decoding, and training strategies for time-series forecasting, demonstrating superior performance across diverse temporal prediction tasks.

Abstract: In machine learning, effective modeling requires a holistic consideration of how to encode inputs, make predictions (i.e., decoding), and train the model. However, in time-series forecasting, prior work has predominantly focused on encoder design, often treating prediction and training as separate or secondary concerns. In this paper, we propose TimePerceiver, a unified encoder-decoder forecasting framework that is tightly aligned with an effective training strategy. To be specific, we first generalize the forecasting task to include diverse temporal prediction objectives such as extrapolation, interpolation, and imputation. Since this generalization requires handling input and target segments that are arbitrarily positioned along the temporal axis, we design a novel encoder-decoder architecture that can flexibly perceive and adapt to these varying positions. For encoding, we introduce a set of latent bottleneck representations that can interact with all input segments to jointly capture temporal and cross-channel dependencies. For decoding, we leverage learnable queries corresponding to target timestamps to effectively retrieve relevant information. Extensive experiments demonstrate that our framework consistently and significantly outperforms prior state-of-the-art baselines across a wide range of benchmark datasets. The code is available at https://github.com/efficient-learning-lab/TimePerceiver.

[521] On Admissible Rank-based Input Normalization Operators

Taeyun Kim

Main category: cs.LG

TL;DR: The paper shows that current differentiable sorting/ranking operators fail stability requirements for rank-based normalization and proposes axioms for proper design.

Details

Motivation: Rank-based normalization is widely used for its robustness to scale and transformations, but existing differentiable operators lack formal stability guarantees under monotone transformations and batch variations.

Method: Proposes three axioms for rank-based normalization invariance/stability, proves any valid operator must factor into feature-wise rank representation plus monotone Lipschitz scalarization, and constructs a minimal operator meeting these criteria.

Result: Shows existing differentiable sorting/ranking operators fundamentally fail stability criteria due to structural design, not implementation. The proposed axioms delineate valid design space and separate from continuous-relaxation methods.

Conclusion: Formally establishes structural requirements for stable rank-based normalization operators, providing theoretical foundation for future designs and showing current differentiable approaches are fundamentally flawed.

Abstract: Rank-based input normalization is a workhorse of modern machine learning, prized for its robustness to scale, monotone transformations, and batch-to-batch variation. In many real systems, the ordering of feature values matters far more than their raw magnitudes - yet the structural conditions that a rank-based normalization operator must satisfy to remain stable under these invariances have never been formally pinned down. We show that widely used differentiable sorting and ranking operators fundamentally fail these criteria. Because they rely on value gaps and batch-level pairwise interactions, they are intrinsically unstable under strictly monotone transformations, shifts in mini-batch composition, and even tiny input perturbations. Crucially, these failures stem from the operators’ structural design, not from incidental implementation choices. To address this, we propose three axioms that formalize the minimal invariance and stability properties required of rank-based input normalization. We prove that any operator satisfying these axioms must factor into (i) a feature-wise rank representation and (ii) a scalarization map that is both monotone and Lipschitz-continuous. We then construct a minimal operator that meets these criteria and empirically show that the resulting constraints are non-trivial in realistic setups. Together, our results sharply delineate the design space of valid rank-based normalization operators and formally separate them from existing continuous-relaxation-based sorting methods.

[522] Data-Driven Analysis of Crash Patterns in SAE Level 2 and Level 4 Automated Vehicles Using K-means Clustering and Association Rule Mining

Jewel Rana Palit, Vijayalakshmi K Kumarasamy, Osama A. Osman

Main category: cs.LG

TL;DR: This study analyzes over 2,500 AV crash records from NHTSA using a two-stage data mining framework to uncover crash dynamics across SAE Levels 2 and 4, identifying behavioral clusters and multivariate relationships between crash patterns and contributors.

Details

Motivation: Recent crash data shows AV behavior can deviate from expected safety outcomes, raising concerns about AV safety in mixed traffic environments. Most existing studies rely on small, California-centered datasets with limited focus on understanding crash trends across different SAE automation levels.

Method: A two-stage data mining framework: 1) K-means clustering to segment 2,500+ AV crash records into 4 distinct behavioral clusters based on temporal, spatial, and environmental factors; 2) Association Rule Mining (ARM) to extract interpretable multivariate relationships between crash patterns and contributors (lighting, surface conditions, vehicle dynamics, environmental conditions) within each cluster.

Result: The analysis uncovers underlying crash dynamics across SAE Levels 2 and 4, identifying distinct behavioral clusters and revealing multivariate relationships between crash patterns and various contributing factors.

Conclusion: The insights provide actionable guidance for AV developers, safety regulators, and policymakers in formulating AV deployment strategies and minimizing crash risks, addressing the need for broader understanding of AV safety beyond limited regional datasets.

Abstract: Automated Vehicles (AV) hold potential to reduce or eliminate human driving errors, enhance traffic safety, and support sustainable mobility. Recently, crash data has increasingly revealed that AV behavior can deviate from expected safety outcomes, raising concerns about the technology’s safety and operational reliability in mixed traffic environments. While past research has investigated AV crash, most studies rely on small-size California-centered datasets, with a limited focus on understanding crash trends across various SAE Levels of automation. This study analyzes over 2,500 AV crash records from the United States National Highway Traffic Safety Administration (NHTSA), covering SAE Levels 2 and 4, to uncover underlying crash dynamics. A two-stage data mining framework is developed. K-means clustering is first applied to segment crash records into 4 distinct behavioral clusters based on temporal, spatial, and environmental factors. Then, Association Rule Mining (ARM) is used to extract interpretable multivariate relationships between crash patterns and crash contributors including lighting conditions, surface condition, vehicle dynamics, and environmental conditions within each cluster. These insights provide actionable guidance for AV developers, safety regulators, and policymakers in formulating AV deployment strategies and minimizing crash risks.

[523] Energy-Guided Flow Matching Enables Few-Step Conformer Generation and Ground-State Identification

Guikun Xu, Xiaohan Yi, Peilin Zhao, Yatao Bian

Main category: cs.LG

TL;DR: EnFlow is a unified framework combining flow matching with an energy model to generate low-energy molecular conformer ensembles and identify ground-state conformations efficiently.

Details

Motivation: Current approaches are fragmented: generative models capture diversity but lack reliable energy calibration, while deterministic predictors target single structures and fail to represent ensemble variability. Physics-based pipelines remain computationally demanding.

Method: EnFlow couples flow matching with an explicitly learned energy model through energy-guided sampling along a non-Gaussian flow matching path. It incorporates energy-gradient guidance during sampling to steer trajectories toward lower-energy regions.

Result: Extensive experiments on GEOM-QM9 and GEOM-Drugs show EnFlow improves generation metrics with 1-2 ODE-steps and reduces ground-state prediction errors compared to state-of-the-art methods.

Conclusion: EnFlow provides a unified framework that simultaneously addresses both conformational diversity and energy accuracy, enabling efficient generation of low-energy conformer ensembles and accurate ground-state identification.

Abstract: Generating low-energy conformer ensembles and identifying ground-state conformations from molecular graphs remain computationally demanding with physics-based pipelines. Current learning-based approaches often suffer from a fragmented paradigm: generative models capture diversity but lack reliable energy calibration, whereas deterministic predictors target a single structure and fail to represent ensemble variability. Here we present EnFlow, a unified framework that couples flow matching (FM) with an explicitly learned energy model through an energy-guided sampling scheme defined along a non-Gaussian FM path. By incorporating energy-gradient guidance during sampling, our method steers trajectories toward lower-energy regions, substantially improving conformational fidelity, particularly in the few-step regime. The learned energy function further enables efficient energy-based ranking of generated ensembles for accurate ground-state identification. Extensive experiments on GEOM-QM9 and GEOM-Drugs demonstrate that EnFlow simultaneously improves generation metrics with 1–2 ODE-steps and reduces ground-state prediction errors compared with state-of-the-art methods.

[524] Scaling Unverifiable Rewards: A Case Study on Visual Insights

Shuyu Gan, James Mooney, Pan Hao, Renxiang Wang, Mingyi Hong, Qianwen Wang, Dongyeop Kang

Main category: cs.LG

TL;DR: Selective TTS: A process-based refinement framework that distributes compute across stages in multi-agent pipelines instead of repeated temporal refinement, using process-specific judges to prune low-quality branches early and mitigate judge drift.

Details

Motivation: Real-world multi-stage pipeline tasks often lack verifiable final rewards or sufficient data to train robust reward models, making judge-based refinement prone to accumulating errors over stages. Traditional Test-Time Scaling (TTS) with iterative refinement struggles with these open-ended tasks.

Method: Selective TTS distributes compute across different stages in multi-agent pipelines rather than repeated refinement over time. It uses process-specific judges to prune low-quality branches early, mitigating judge drift. The framework is grounded in a data science pipeline with an end-to-end multi-agent system for generating charts and reports, including a reliable LLM-based judge model aligned with human experts.

Result: Selective TTS improves insight quality under fixed compute budget, increasing mean scores from 61.64 to 65.86 while reducing variance. The LLM-based judge model achieves alignment with human experts (Kendall’s τ=0.55).

Conclusion: Selective TTS represents a first step toward scaling complex, open-ended tasks with unverifiable rewards, such as scientific discovery and story generation, by stabilizing refinement through process-based compute distribution and early pruning.

Abstract: Large Language Model (LLM) agents can increasingly automate complex reasoning through Test-Time Scaling (TTS), iterative refinement guided by reward signals. However, many real-world tasks involve multi-stage pipeline whose final outcomes lack verifiable rewards or sufficient data to train robust reward models, making judge-based refinement prone to accumulate error over stages. We propose Selective TTS, a process-based refinement framework that scales inference across different stages in multi-agent pipeline, instead of repeated refinement over time by prior work. By distributing compute across stages and pruning low-quality branches early using process-specific judges, Selective TTS mitigates the judge drift and stabilizes refinement. Grounded in the data science pipeline, we build an end-to-end multi-agent pipeline for generating visually insightful charts and report of given dataset, and design a reliable LLM-based judge model, aligned with human experts (Kendall’s τ=0.55). Our proposed selective TTS then improves insight quality under a fixed compute budget, increasing mean scores from 61.64 to 65.86 while reducing variance. We hope our findings serve as the first step toward to scaling complex, open-ended tasks with unverifiable rewards, such as scientific discovery and story generation.

[525] Cryptocurrency Price Prediction Using Parallel Gated Recurrent Units

Milad Asadpour, Alireza Rezaee, Farshid Hajati

Main category: cs.LG

TL;DR: Proposes PGRU, a parallel gated recurrent units model for cryptocurrency price prediction using multiple independent RNNs with different price features, achieving MAPE of 2.64-3.24%.

Details

Motivation: Cryptocurrency markets attract substantial investment but face price volatility challenges. Accurate price prediction is crucial for investors, requiring efficient methods that handle cryptocurrency's unique characteristics and price fluctuations.

Method: Parallel Gated Recurrent Units (PGRU) model uses multiple independent recurrent neural networks operating in parallel, each processing different price-related features. The outputs are combined by a neural network for final price prediction.

Result: Achieves MAPE of 3.243% for window length 20 and 2.641% for window length 15, demonstrating higher accuracy with fewer input data and lower computational cost compared to existing methods.

Conclusion: PGRU provides an effective deep learning approach for cryptocurrency price forecasting, offering improved accuracy and efficiency through parallel processing of diverse price features.

Abstract: According to the advent of cryptocurrencies and Bitcoin, many investments and businesses are now conducted online through cryptocurrencies. Among them, Bitcoin uses blockchain technology to make transactions secure, transparent, traceable, and immutable. It also exhibits significant price fluctuations and performance, which has attracted substantial attention, especially in financial sectors. Consequently, a wide range of investors and individuals have turned to investing in the cryptocurrency market. One of the most important challenges in economics is price forecasting for future trades. Cryptocurrencies are no exception, and investors are looking for methods to predict prices; various theories and methods have been proposed in this field. This paper presents a new deep model, called \emph{Parallel Gated Recurrent Units} (PGRU), for cryptocurrency price prediction. In this model, recurrent neural networks forecast prices in a parallel and independent way. The parallel networks utilize different inputs, each representing distinct price-related features. Finally, the outputs of the parallel networks are combined by a neural network to forecast the future price of cryptocurrencies. The experimental results indicate that the proposed model achieves mean absolute percentage errors (MAPE) of 3.243% and 2.641% for window lengths 20 and 15, respectively. Our method therefore attains higher accuracy and efficiency with fewer input data and lower computational cost compared to existing methods.

[526] Debugging Tabular Log as Dynamic Graphs

Chumeng Liang, Zhanyang Jin, Zahaib Akhtar, Mona Pereira, Haofei Yu, Jiaxuan You

Main category: cs.LG

TL;DR: GraphLogDebugger: A dynamic graph-based framework for debugging tabular logs that outperforms LLMs using simple GNNs.

Details

Motivation: Current tabular log debugging methods overly rely on LLMs and heavy models, suffering from limited flexibility and scalability. Tabular logs capture real-world system updates, but existing approaches don't effectively model the underlying dynamic relationships.

Method: Proposes GraphLogDebugger framework that constructs heterogeneous nodes for objects and events, connects them with edges to create an evolving dynamic graph representation of the system. Uses a simple dynamic Graph Neural Network (GNN) for debugging.

Result: The dynamic graph modeling enables a simple GNN to outperform LLMs in debugging tabular logs. Experimental validation on real-world log datasets of computer systems and academic papers demonstrates effectiveness.

Conclusion: Dynamic graph modeling provides a more flexible and scalable approach to tabular log debugging than LLM-dependent methods, with better performance using simpler models.

Abstract: Tabular log abstracts objects and events in the real-world system and reports their updates to reflect the change of the system, where one can detect real-world inconsistencies efficiently by debugging corresponding log entries. However, recent advances in processing text-enriched tabular log data overly depend on large language models (LLMs) and other heavy-load models, thus suffering from limited flexibility and scalability. This paper proposes a new framework, GraphLogDebugger, to debug tabular log based on dynamic graphs. By constructing heterogeneous nodes for objects and events and connecting node-wise edges, the framework recovers the system behind the tabular log as an evolving dynamic graph. With the help of our dynamic graph modeling, a simple dynamic Graph Neural Network (GNN) is representative enough to outperform LLMs in debugging tabular log, which is validated by experimental results on real-world log datasets of computer systems and academic papers.

[527] Gold Price Prediction Using Long Short-Term Memory and Multi-Layer Perceptron with Gray Wolf Optimizer

Hesam Taghipour, Alireza Rezaee, Farshid Hajati

Main category: cs.LG

TL;DR: Hybrid LSTM-MLP model optimized with Gray Wolf algorithm for gold price forecasting achieves 171% return in 3 months with MAE of $0.21 for daily and $22.23 for monthly predictions.

Details

Motivation: Gold market forecasting is challenging due to complex economic and political relationships, but accurate prediction models would provide significant benefits to financial institutions and investors.

Method: Two LSTM networks handle daily and monthly forecasting, integrated via MLP network. Gray Wolf optimization tunes neuron counts. Uses comprehensive dataset (2010-2021) covering macroeconomic, energy, stocks, and currency data.

Result: Model achieved 171% return in 3-month live trading. Daily closing price MAE: $0.21, monthly price MAE: $22.23. Predicts high, low, and closing prices for both timeframes.

Conclusion: The AI-based LSTM-MLP hybrid model with GWO optimization effectively forecasts gold prices and generates profitable trading strategies, demonstrating practical financial application value.

Abstract: The global gold market, by its fundamentals, has long been home to many financial institutions, banks, governments, funds, and micro-investors. Due to the inherent complexity and relationship between important economic and political components, accurate forecasting of financial markets has always been challenging. Therefore, providing a model that can accurately predict the future of the markets is very important and will be of great benefit to their developers. In this paper, an artificial intelligence-based algorithm for daily and monthly gold forecasting is presented. Two Long short-term memory (LSTM) networks are responsible for daily and monthly forecasting, the results of which are integrated into a Multilayer perceptrons (MLP) network and provide the final forecast of the next day prices. The algorithm forecasts the highest, lowest, and closing prices on the daily and monthly time frame. Based on these forecasts, a trading strategy for live market trading was developed, according to which the proposed model had a return of 171% in three months. Also, the number of internal neurons in each network is optimized by the Gray Wolf optimization (GWO) algorithm based on the least RMSE error. The dataset was collected between 2010 and 2021 and includes data on macroeconomic, energy markets, stocks, and currency status of developed countries. Our proposed LSTM-MLP model predicted the daily closing price of gold with the Mean absolute error (MAE) of $ 0.21 and the next month’s price with $ 22.23.

[528] A Note on Hybrid Online Reinforcement and Imitation Learning for LLMs: Formulations and Algorithms

Yingru Li, Ziniu Li, Jiacai Liu

Main category: cs.LG

TL;DR: A unified LLM fine-tuning framework combining Imitation Learning and Reinforcement Learning through gradient decomposition into dense and sparse components.

Details

Motivation: To create a more efficient and unified approach for LLM fine-tuning that combines the benefits of both imitation learning (for token-level guidance) and reinforcement learning (for long-horizon reward optimization) in a single framework.

Method: The method analyzes the gradient of a composite objective function that combines trajectory-level KL divergence with task rewards. This gradient is decomposed into two components: 1) Dense Gradient for token-level imitation (analytically computable with closed-form logit-level formula), and 2) Sparse Gradient for long-horizon reward optimization (estimated via Monte Carlo methods).

Result: The framework enables efficient GPU implementation through the closed-form logit-level formula for the Dense Gradient component, while maintaining the ability to optimize for long-term rewards through the Sparse Gradient component.

Conclusion: The proposed unified framework successfully integrates imitation and reinforcement learning for LLM fine-tuning, providing both efficient computation and effective long-horizon optimization capabilities.

Abstract: We present a unified framework for Large Language Model (LLM) fine-tuning that integrates Imitation Learning and Reinforcement Learning. By analyzing the gradient of a composite objective combining trajectory-level KL divergence with task rewards, we derive a natural decomposition into two components: (1) an analytically computable Dense Gradient for token-level imitation, and (2) a Monte Carlo estimated Sparse Gradient for long-horizon reward optimization. The Dense Gradient admits a closed-form logit-level formula, enabling efficient GPU implementation.

[529] Communication Compression for Distributed Learning with Aggregate and Server-Guided Feedback

Tomas Ortega, Chun-Yin Huang, Xiaoxiao Li, Hamid Jafarkhani

Main category: cs.LG

TL;DR: Novel compression frameworks CAFe and CAFe-S enable biased compression in federated learning without client-side state, using aggregated updates as shared control variates to reduce communication costs while maintaining privacy.

Details

Motivation: Federated learning faces communication bottlenecks, especially in uplink transmission. Existing biased compression techniques require error feedback with client-specific control variates, which violates privacy and is incompatible with stateless clients in large-scale FL.

Method: Two frameworks: 1) CAFe uses globally aggregated update from previous round as shared control variate for all clients. 2) CAFe-S extends this for servers with private datasets, generating server-guided candidate updates as more accurate predictors. Both enable biased compression without client-side state.

Result: Analytically proved CAFe’s superiority over DCGD with biased compression in non-convex regime with bounded gradient dissimilarity. Proved CAFe-S converges to stationary point with rate improving as server’s data become more representative. Experimental results validate superiority over existing compression schemes.

Conclusion: The proposed frameworks enable efficient biased compression in FL without client-side state, addressing privacy concerns and compatibility with stateless clients while reducing communication costs and maintaining convergence guarantees.

Abstract: Distributed learning, particularly Federated Learning (FL), faces a significant bottleneck in the communication cost, particularly the uplink transmission of client-to-server updates, which is often constrained by asymmetric bandwidth limits at the edge. Biased compression techniques are effective in practice, but require error feedback mechanisms to provide theoretical guarantees and to ensure convergence when compression is aggressive. Standard error feedback, however, relies on client-specific control variates, which violates user privacy and is incompatible with stateless clients common in large-scale FL. This paper proposes two novel frameworks that enable biased compression without client-side state or control variates. The first, Compressed Aggregate Feedback (CAFe), uses the globally aggregated update from the previous round as a shared control variate for all clients. The second, Server-Guided Compressed Aggregate Feedback (CAFe-S), extends this idea to scenarios where the server possesses a small private dataset; it generates a server-guided candidate update to be used as a more accurate predictor. We consider Distributed Gradient Descent (DGD) as a representative algorithm and analytically prove CAFe’s superiority to Distributed Compressed Gradient Descent (DCGD) with biased compression in the non-convex regime with bounded gradient dissimilarity. We further prove that CAFe-S converges to a stationary point, with a rate that improves as the server’s data become more representative. Experimental results in FL scenarios validate the superiority of our approaches over existing compression schemes.

[530] Theoretical Foundations of Scaling Law in Familial Models

Huan Song, Qingfei Zhao, Ting Long, Shuyu Tian, Hongjun An, Jiawei Shao, Chi Zhang, Xuelong Li

Main category: cs.LG

TL;DR: The paper extends neural scaling laws to familial models (early-exit architectures), introducing granularity (G) as a third scaling variable alongside model size (N) and tokens (D), showing minimal performance penalty for deployment flexibility.

Details

Motivation: Traditional neural scaling laws assume single dense model outputs, overlooking familial models that enable ubiquitous intelligence across heterogeneous device-edge-cloud hierarchies through early exits and relay-style inference.

Method: Proposes unified scaling law L(N, D, G) with granularity as third variable, uses IsoFLOP experimental design to isolate architectural impact, systematically sweeps model sizes and granularities while adjusting tokens to decouple marginal costs.

Result: Granularity penalty follows multiplicative power law with extremely small exponent, validating that deployment flexibility can be achieved without compromising compute-optimality of dense baselines.

Conclusion: Theoretical extension bridges fixed-compute training with dynamic architectures, practically validates “train once, deploy many” paradigm for familial models across heterogeneous computing environments.

Abstract: Neural scaling laws have become foundational for optimizing large language model (LLM) training, yet they typically assume a single dense model output. This limitation effectively overlooks “Familial models, a transformative paradigm essential for realizing ubiquitous intelligence across heterogeneous device-edge-cloud hierarchies. Transcending static architectures, familial models integrate early exits with relay-style inference to spawn G deployable sub-models from a single shared backbone. In this work, we theoretically and empirically extend the scaling law to capture this “one-run, many-models” paradigm by introducing Granularity (G) as a fundamental scaling variable alongside model size (N) and training tokens (D). To rigorously quantify this relationship, we propose a unified functional form L(N, D, G) and parameterize it using large-scale empirical runs. Specifically, we employ a rigorous IsoFLOP experimental design to strictly isolate architectural impact from computational scale. Across fixed budgets, we systematically sweep model sizes (N) and granularities (G) while dynamically adjusting tokens (D). This approach effectively decouples the marginal cost of granularity from the benefits of scale, ensuring high-fidelity parameterization of our unified scaling law. Our results reveal that the granularity penalty follows a multiplicative power law with an extremely small exponent. Theoretically, this bridges fixed-compute training with dynamic architectures. Practically, it validates the “train once, deploy many” paradigm, demonstrating that deployment flexibility is achievable without compromising the compute-optimality of dense baselines.

[531] Quantum Generative Models for Computational Fluid Dynamics: A First Exploration of Latent Space Learning in Lattice Boltzmann Simulations

Achraf Hsain, Fouad Mohammed Abbou

Main category: cs.LG

TL;DR: Quantum generative models (QCBM & QGAN) outperform classical LSTM in generating compressed CFD latent representations, with QCBM achieving best results.

Details

Motivation: To explore the application of quantum generative models to learned latent space representations of computational fluid dynamics data, which remains unexplored despite recent quantum approaches for fluid systems.

Method: 1. Generate fluid vorticity fields using GPU-accelerated Lattice Boltzmann Method simulator. 2. Compress data into discrete 7D latent space using Vector Quantized Variational Autoencoder. 3. Compare quantum (QCBM & QGAN) vs classical (LSTM) generative models for modeling the physics-derived latent distribution.

Result: Both quantum models produced samples with lower average minimum distances to the true distribution compared to LSTM baseline, with Quantum Circuit Born Machine achieving the most favorable metrics.

Conclusion: This work establishes: 1) complete open-source pipeline bridging CFD simulation and quantum ML, 2) first empirical study of quantum generative modeling on compressed physics simulation representations, and 3) foundation for future rigorous investigation at this intersection.

Abstract: This paper presents the first application of quantum generative models to learned latent space representations of computational fluid dynamics (CFD) data. While recent work has explored quantum models for learning statistical properties of fluid systems, the combination of discrete latent space compression with quantum generative sampling for CFD remains unexplored. We develop a GPU-accelerated Lattice Boltzmann Method (LBM) simulator to generate fluid vorticity fields, which are compressed into a discrete 7-dimensional latent space using a Vector Quantized Variational Autoencoder (VQ-VAE). The central contribution is a comparative analysis of quantum and classical generative approaches for modeling this physics-derived latent distribution: we evaluate a Quantum Circuit Born Machine (QCBM) and Quantum Generative Adversarial Network (QGAN) against a classical Long Short-Term Memory (LSTM) baseline. Under our experimental conditions, both quantum models produced samples with lower average minimum distances to the true distribution compared to the LSTM, with the QCBM achieving the most favorable metrics. This work provides: (1)~a complete open-source pipeline bridging CFD simulation and quantum machine learning, (2)~the first empirical study of quantum generative modeling on compressed latent representations of physics simulations, and (3)~a foundation for future rigorous investigation at this intersection.

[532] VL-RouterBench: A Benchmark for Vision-Language Model Routing

Zhehao Huang, Baijiong Lin, Jingyuan Zhang, Jingying Wang, Yuhang Liu, Ning Lu, Tao Li, Xiaolin Huang

Main category: cs.LG

TL;DR: VL-RouterBench: A systematic benchmark for evaluating vision-language model routing systems across 14 datasets, 17 models, and 30K+ samples, measuring accuracy, cost, and throughput to assess routing methods.

Details

Motivation: Existing work lacks a systematic, reproducible benchmark for evaluating vision-language model (VLM) routing systems, despite multi-model routing evolving from engineering technique to essential infrastructure.

Method: Constructs quality and cost matrices from raw VLM inference/scoring logs over sample-model pairs. Covers 14 datasets across 3 task groups (30,540 samples), 15 open-source models + 2 API models (519,180 sample-model pairs). Evaluation protocol measures average accuracy, cost, throughput, and uses harmonic mean of normalized cost/accuracy for ranking.

Result: Evaluated 10 routing methods and baselines, showing significant routability gain. However, best current routers still show clear gap to ideal Oracle, indicating room for improvement through finer visual cues and textual structure modeling.

Conclusion: VL-RouterBench provides systematic evaluation framework for VLM routing. Will open-source data construction and evaluation toolchain to promote comparability, reproducibility, and practical deployment in multimodal routing research.

Abstract: Multi-model routing has evolved from an engineering technique into essential infrastructure, yet existing work lacks a systematic, reproducible benchmark for evaluating vision-language models (VLMs). We present VL-RouterBench to assess the overall capability of VLM routing systems systematically. The benchmark is grounded in raw inference and scoring logs from VLMs and constructs quality and cost matrices over sample-model pairs. In scale, VL-RouterBench covers 14 datasets across 3 task groups, totaling 30,540 samples, and includes 15 open-source models and 2 API models, yielding 519,180 sample-model pairs and a total input-output token volume of 34,494,977. The evaluation protocol jointly measures average accuracy, average cost, and throughput, and builds a ranking score from the harmonic mean of normalized cost and accuracy to enable comparison across router configurations and cost budgets. On this benchmark, we evaluate 10 routing methods and baselines and observe a significant routability gain, while the best current routers still show a clear gap to the ideal Oracle, indicating considerable room for improvement in router architecture through finer visual cues and modeling of textual structure. We will open-source the complete data construction and evaluation toolchain to promote comparability, reproducibility, and practical deployment in multimodal routing research.

[533] Beyond Centralization: Provable Communication Efficient Decentralized Multi-Task Learning

Donghwa Kang, Shana Moothedath

Main category: cs.LG

TL;DR: Decentralized multi-task representation learning with low-rank feature structure, where communication cost is independent of target accuracy.

Details

Motivation: While centralized representation learning is well-studied, decentralized methods remain underexplored. There's a need for efficient decentralized approaches that can handle data-scarce environments where tasks share common low-rank features, with data distributed across nodes in communication-constrained networks.

Method: Proposed a new alternating projected gradient and minimization algorithm for decentralized multi-task representation learning. The method assumes features share a low-rank structure and tasks follow linear models with task-specific parameters. The algorithm operates in a decentralized setting where task data is distributed across nodes with constrained communication networks.

Result: The algorithm provides provable accuracy guarantees with comprehensive characterizations of time, communication, and sample complexities. Key result: communication complexity is independent of target accuracy, significantly reducing communication cost compared to prior methods. Numerical simulations validate theoretical analysis across different dimensions and network topologies.

Conclusion: Decentralized learning with the proposed algorithm can outperform centralized federated approaches in certain regimes, offering communication-efficient solutions for multi-task representation learning in data-scarce distributed environments with low-rank feature structures.

Abstract: Representation learning is a widely adopted framework for learning in data-scarce environments, aiming to extract common features from related tasks. While centralized approaches have been extensively studied, decentralized methods remain largely underexplored. We study decentralized multi-task representation learning in which the features share a low-rank structure. We consider multiple tasks, each with a finite number of data samples, where the observations follow a linear model with task-specific parameters. In the decentralized setting, task data are distributed across multiple nodes, and information exchange between nodes is constrained by a communication network. The goal is to recover the underlying feature matrix whose rank is much smaller than both the parameter dimension and the number of tasks. We propose a new alternating projected gradient and minimization algorithm with provable accuracy guarantees. We provide comprehensive characterizations of the time, communication, and sample complexities. Importantly, the communication complexity is independent of the target accuracy, which significantly reduces communication cost compared to prior methods. Numerical simulations validate the theoretical analysis across different dimensions and network topologies, and demonstrate regimes in which decentralized learning outperforms centralized federated approaches.

[534] Learning with the $p$-adics

André F. T. Martins

Main category: cs.LG

TL;DR: The paper proposes using p-adic numbers (ℚₚ) instead of real numbers (ℝ) as an alternative mathematical foundation for machine learning frameworks, exploring their potential for hierarchical representation learning and code theory.

Details

Motivation: Current ML frameworks operate over real numbers with Euclidean/Hilbert spaces, but the authors question whether this is the only viable choice. They explore p-adic numbers as an alternative due to their ultrametric, non-archimedean properties and hierarchical structure that could be beneficial for code theory and hierarchical representation learning.

Method: Theoretical exploration establishing building blocks for classification, regression, and representation learning with p-adic numbers. Development of learning models and algorithms using p-adic spaces. Demonstration of representing Quillian semantic networks as compact p-adic linear networks.

Result: Shows that p-adic spaces enable representation of Quillian semantic networks as compact linear networks, a construction not possible with real numbers. Provides foundational theoretical framework for p-adic ML.

Conclusion: P-adic numbers offer a promising alternative to real numbers for ML frameworks, particularly for hierarchical representation learning and code theory. The work opens new research directions and identifies open problems for future exploration in this novel framework.

Abstract: Existing machine learning frameworks operate over the field of real numbers ($\mathbb{R}$) and learn representations in real (Euclidean or Hilbert) vector spaces (e.g., $\mathbb{R}^d$). Their underlying geometric properties align well with intuitive concepts such as linear separability, minimum enclosing balls, and subspace projection; and basic calculus provides a toolbox for learning through gradient-based optimization. But is this the only possible choice? In this paper, we study the suitability of a radically different field as an alternative to $\mathbb{R}$ – the ultrametric and non-archimedean space of $p$-adic numbers, $\mathbb{Q}_p$. The hierarchical structure of the $p$-adics and their interpretation as infinite strings make them an appealing tool for code theory and hierarchical representation learning. Our exploratory theoretical work establishes the building blocks for classification, regression, and representation learning with the $p$-adics, providing learning models and algorithms. We illustrate how simple Quillian semantic networks can be represented as a compact $p$-adic linear network, a construction which is not possible with the field of reals. We finish by discussing open problems and opportunities for future research enabled by this new framework.

[535] Training AI Co-Scientists Using Rubric Rewards

Shashwat Goel, Rishi Hazra, Dulhan Jayalath, Timon Willi, Parag Jain, William F. Shen, Ilias Leontiadis, Francesco Barbieri, Yoram Bachrach, Jonas Geiping, Chenxi Whitehouse

Main category: cs.LG

TL;DR: Researchers develop AI co-scientists that generate better research plans using reinforcement learning with self-grading, trained on automatically extracted goals and rubrics from research papers across domains.

Details

Motivation: Current language models struggle to generate research plans that follow all constraints and implicit requirements, limiting their effectiveness as AI co-scientists for human researchers.

Method: Build scalable training corpus by automatically extracting research goals and goal-specific grading rubrics from papers across domains. Train models via reinforcement learning with self-grading, using a frozen copy of initial policy as grader to create generator-verifier gap.

Result: Human experts prefer plans from finetuned Qwen3-30B-A3B model over initial model for 70% of research goals, approve 84% of automatically extracted rubrics. Finetuning yields 12-22% relative improvements with significant cross-domain generalization to medical research and arXiv preprints.

Conclusion: Demonstrates potential of scalable, automated training recipe for improving general AI co-scientists, effective even in domains like medical research where execution feedback is infeasible.

Abstract: AI co-scientists are emerging as a tool to assist human researchers in achieving their research goals. A crucial feature of these AI co-scientists is the ability to generate a research plan given a set of aims and constraints. The plan may be used by researchers for brainstorming, or may even be implemented after further refinement. However, language models currently struggle to generate research plans that follow all constraints and implicit requirements. In this work, we study how to leverage the vast corpus of existing research papers to train language models that generate better research plans. We build a scalable, diverse training corpus by automatically extracting research goals and goal-specific grading rubrics from papers across several domains. We then train models for research plan generation via reinforcement learning with self-grading. A frozen copy of the initial policy acts as the grader during training, with the rubrics creating a generator-verifier gap that enables improvements without external human supervision. To validate this approach, we conduct a study with human experts for machine learning research goals, spanning 225 hours. The experts prefer plans generated by our finetuned Qwen3-30B-A3B model over the initial model for 70% of research goals, and approve 84% of the automatically extracted goal-specific grading rubrics. To assess generality, we also extend our approach to research goals from medical papers, and new arXiv preprints, evaluating with a jury of frontier models. Our finetuning yields 12-22% relative improvements and significant cross-domain generalization, proving effective even in problem settings like medical research where execution feedback is infeasible. Together, these findings demonstrate the potential of a scalable, automated training recipe as a step towards improving general AI co-scientists.

[536] Predictive Modeling of Power Outages during Extreme Events: Integrating Weather and Socio-Economic Factors

Antar Kumar Biswas, Masoud H. Nazari

Main category: cs.LG

TL;DR: A machine learning framework predicts power outages from extreme events using EAGLE-I outage data (2014-2024) combined with weather, socio-economic, infrastructure, and seasonal features, with LSTM achieving best performance.

Details

Motivation: To predict low-probability, high-consequence power outages caused by extreme events by understanding community vulnerability patterns and outage risk factors.

Method: Integrates EAGLE-I outage records with weather, socio-economic, infrastructure, and seasonal data. Evaluates four ML models: Random Forest, SVM, AdaBoost, and LSTM on Michigan county data.

Result: LSTM achieves lowest prediction error. Stronger economic conditions and more developed infrastructure correlate with lower outage occurrence.

Conclusion: The learning-based framework effectively predicts extreme event outages, with LSTM performing best and socio-economic/infrastructure factors being important predictors.

Abstract: This paper presents a novel learning-based framework for predicting power outages caused by extreme events. The proposed approach specifically targets low-probability, high-consequence outage scenarios and leverages a comprehensive set of features derived from publicly available data sources. We integrate EAGLE-I outage records (2014-2024) with weather, socio-economic, infrastructure, and seasonal event data. Incorporating social and demographic indicators reveals underlying patterns of community vulnerability and provides a clearer understanding of outage risk during extreme conditions. Four machine learning models (Random Forest (RF), Support Vector Machine (SVM), Adaptive Boosting (AdaBoost), and Long Short-Term Memory (LSTM)) are evaluated. Experimental validation is performed on a large-scale dataset covering counties in the lower peninsula of Michigan. Among all models tested, the LSTM network achieves the lowest prediction error. Additionally, the results demonstrate that stronger economic conditions and more developed infrastructure are associated with lower outage occurrence.

[537] What Matters in Deep Learning for Time Series Forecasting?

Valentina Moretti, Andrea Cini, Ivan Marisca, Cesare Alippi

Main category: cs.LG

TL;DR: The paper critiques current deep learning time series forecasting practices, arguing that foundational design principles like locality/globality are more important than specific architectures, and proposes a model card to standardize benchmarking.

Details

Motivation: To address the confusion in the deep learning time series forecasting field where many new architectures with contradictory results make it difficult to understand what components actually contribute to performance. The authors want to bring clarity to the design space and improve benchmarking practices.

Method: The paper analyzes current forecasting architectures through the lens of design principles like locality and globality, examines implementation details that affect results, and proposes a systematic approach to characterize architectures using an auxiliary forecasting model card with key design choice fields.

Result: The authors show that accounting for locality/globality aspects is more important than specific sequence modeling layers, and that simple, well-designed architectures can match state-of-the-art performance. They demonstrate how overlooked implementation details fundamentally change forecasting methods and affect empirical results.

Conclusion: Current benchmarking practices in time series forecasting are flawed and need rethinking. Researchers should focus on foundational aspects of forecasting problems when designing architectures rather than chasing specific architectural innovations. The proposed model card provides a step toward more systematic evaluation and characterization of forecasting methods.

Abstract: Deep learning models have grown increasingly popular in time series applications. However, the large quantity of newly proposed architectures, together with often contradictory empirical results, makes it difficult to assess which components contribute significantly to final performance. We aim to make sense of the current design space of deep learning architectures for time series forecasting by discussing the design dimensions and trade-offs that can explain, often unexpected, observed results. This paper discusses the necessity of grounding model design on principles for forecasting groups of time series and how such principles can be applied to current models. In particular, we assess how concepts such as locality and globality apply to recent forecasting architectures. We show that accounting for these aspects can be more relevant for achieving accurate results than adopting specific sequence modeling layers and that simple, well-designed forecasting architectures can often match the state of the art. We discuss how overlooked implementation details in existing architectures (1) fundamentally change the class of the resulting forecasting method and (2) drastically affect the observed empirical results. Our results call for rethinking current faulty benchmarking practices and the need to focus on the foundational aspects of the forecasting problem when designing architectures. As a step in this direction, we propose an auxiliary forecasting model card, whose fields serve to characterize existing and new forecasting architectures based on key design choices.

[538] FoldAct: Efficient and Stable Context Folding for Long-Horizon Search Agents

Jiaqi Shao, Yufeng Miao, Wei Zhang, Bing Luo

Main category: cs.LG

TL;DR: FoldAct addresses RL challenges in long-horizon language model agents with context folding by separating gradient signals, maintaining context consistency, and using selective training for stable, efficient training.

Details

Motivation: Existing context folding methods for long-horizon RL in LLMs treat summary actions as standard actions, creating policy-dependent non-stationary observation distributions that violate RL assumptions and cause gradient dilution, self-conditioning collapse, and high computational costs.

Method: FoldAct introduces three innovations: 1) separated loss computation for independent gradient signals on summary and action tokens, 2) full context consistency loss to reduce distribution shift, and 3) selective segment training to reduce computational cost.

Result: The method enables stable training of long-horizon search agents with context folding, addressing the non-stationary observation problem while achieving 5.19× training speedup.

Conclusion: FoldAct provides a principled framework for addressing fundamental challenges in RL with context folding, enabling scalable long-horizon language model agents through explicit handling of non-stationary observation distributions and computational efficiency improvements.

Abstract: Long-horizon reinforcement learning (RL) for large language models faces critical scalability challenges from unbounded context growth, leading to context folding methods that compress interaction history during task execution. However, existing approaches treat summary actions as standard actions, overlooking that summaries fundamentally modify the agent’s future observation space, creating a policy-dependent, non-stationary observation distribution that violates core RL assumptions. This introduces three fundamental challenges: (1) gradient dilution where summary tokens receive insufficient training signal, (2) self-conditioning where policy updates change summary distributions, creating a vicious cycle of training collapse, and (3) computational cost from processing unique contexts at each turn. We introduce \textbf{FoldAct}\footnote{https://github.com/SHAO-Jiaqi757/FoldAct}, a framework that explicitly addresses these challenges through three key innovations: separated loss computation for independent gradient signals on summary and action tokens, full context consistency loss to reduce distribution shift, and selective segment training to reduce computational cost. Our method enables stable training of long-horizon search agents with context folding, addressing the non-stationary observation problem while improving training efficiency with 5.19$\times$ speedup.

[539] When Does Multi-Task Learning Fail? Quantifying Data Imbalance and Task Independence in Metal Alloy Property Prediction

Sungwoo Kang

Main category: cs.LG

TL;DR: MTL degrades regression but improves classification for alloy properties; near-zero inter-task weights suggest property independence; recommend separate models for regression, MTL for classification.

Details

Motivation: Test whether multi-task learning (MTL) can leverage shared underlying physics between related material properties (electrical resistivity, Vickers hardness, amorphous-forming ability) for better predictions in alloy materials.

Method: Used 54,028 alloy samples to simultaneously predict three properties. Compared single-task models against standard and structured MTL approaches. Analyzed inter-task weights and data characteristics.

Result: Striking dichotomy: MTL significantly degraded regression performance (resistivity R²: 0.897→0.844; hardness R²: 0.832→0.694) but improved classification (amorphous F1: 0.703→0.744; recall +17%). Near-zero inter-task weights indicated property independence. Regression failure attributed to negative transfer from severe data imbalance (52k vs. 800 samples).

Conclusion: Recommend independent models for precise regression tasks, while reserving MTL for classification tasks where recall improvement is critical. Properties appear independent despite assumed shared physics.

Abstract: Multi-task learning (MTL) assumes related material properties share underlying physics that can be leveraged for better predictions. We test this by simultaneously predicting electrical resistivity, Vickers hardness, and amorphous-forming ability using 54,028 alloy samples. We compare single-task models against standard and structured MTL. Results reveal a striking dichotomy: MTL significantly degrades regression performance (resistivity $R^2$: 0.897 $\to$ 0.844; hardness $R^2$: 0.832 $\to$ 0.694, $p < 0.01$) but improves classification (amorphous F1: 0.703 $\to$ 0.744, $p < 0.05$; recall +17%). Analysis shows near-zero inter-task weights, indicating property independence. Regression failure is attributed to negative transfer caused by severe data imbalance (52k vs. 800 samples). We recommend independent models for precise regression, while reserving MTL for classification tasks where recall is critical.

[540] Bridging Global Intent with Local Details: A Hierarchical Representation Approach for Semantic Validation in Text-to-SQL

Rihong Qiu, Zhibang Yang, Xinke Jiang, Weibin Liao, Xin Gao, Xu Chu, Junfeng Zhao, Yasha Wang

Main category: cs.LG

TL;DR: HEROSQL introduces a hierarchical SQL representation combining global intent (Logical Plans) and local details (ASTs) with Nested Message Passing Neural Networks for semantic validation in Text-to-SQL systems, achieving significant improvements in detecting semantic inconsistencies.

Details

Motivation: Existing Text-to-SQL validation approaches focus mainly on syntactic correctness, with few addressing semantic validation (detecting misalignments between questions and SQL). Effective semantic validation faces challenges in capturing both global user intent and SQL structural details, and constructing high-quality fine-grained sub-SQL annotations.

Method: HEROSQL uses hierarchical SQL representation integrating global intent via Logical Plans (LPs) and local details via Abstract Syntax Trees (ASTs). It employs Nested Message Passing Neural Network (NMPNN) to capture relational information in SQL and aggregate schema-guided semantics across LPs and ASTs. Also proposes AST-driven sub-SQL augmentation strategy for generating high-quality negative samples.

Result: Outperforms existing state-of-the-art methods on Text-to-SQL validation benchmarks (both in-domain and out-of-domain), achieving average 9.40% improvement of AUPRC and 12.35% of AUROC in identifying semantic inconsistencies. Excels at detecting fine-grained semantic errors and provides more granular feedback to large language models.

Conclusion: HEROSQL enhances reliability and interpretability of data querying platforms by effectively addressing semantic validation challenges in Text-to-SQL systems through hierarchical representation and neural message passing, ultimately improving system reliability and providing better feedback mechanisms.

Abstract: Text-to-SQL translates natural language questions into SQL statements grounded in a target database schema. Ensuring the reliability and executability of such systems requires validating generated SQL, but most existing approaches focus only on syntactic correctness, with few addressing semantic validation (detecting misalignments between questions and SQL). As a consequence, effective semantic validation still faces two key challenges: capturing both global user intent and SQL structural details, and constructing high-quality fine-grained sub-SQL annotations. To tackle these, we introduce HEROSQL, a hierarchical SQL representation approach that integrates global intent (via Logical Plans, LPs) and local details (via Abstract Syntax Trees, ASTs). To enable better information propagation, we employ a Nested Message Passing Neural Network (NMPNN) to capture inherent relational information in SQL and aggregate schema-guided semantics across LPs and ASTs. Additionally, to generate high-quality negative samples, we propose an AST-driven sub-SQL augmentation strategy, supporting robust optimization of fine-grained semantic inconsistencies. Extensive experiments conducted on Text-to-SQL validation benchmarks (both in-domain and out-of-domain settings) demonstrate that our approach outperforms existing state-of-the-art methods, achieving an average 9.40% improvement of AUPRC and 12.35% of AUROC in identifying semantic inconsistencies. It excels at detecting fine-grained semantic errors, provides large language models with more granular feedback, and ultimately enhances the reliability and interpretability of data querying platforms.

[541] From Confounding to Learning: Dynamic Service Fee Pricing on Third-Party Platforms

Rui Ai, David Simchi-Levi, Feng Zhu

Main category: cs.LG

TL;DR: Platforms face strategic agents and must learn demand from equilibrium prices/quantities, developing optimal regret algorithm with phase transition based on supply noise.

Details

Motivation: Third-party platforms need to set optimal prices but face strategic agents and can only observe equilibrium outcomes, creating a demand learning problem under confounding.

Method: Develop algorithm using non-i.i.d. actions as instrumental variables, novel homeomorphic construction for estimation bounds without star-shapedness, and deep neural networks for demand learning.

Result: Achieve optimal regret of $\Tilde{\cO}(\sqrt{T}\wedgeσ_S^{-2})$, show supply-side noise causes phase transition in regret, and demonstrate practical applicability with simulations and real-world data.

Conclusion: Strategic agents and supply noise fundamentally affect demand learnability, but non-i.i.d. actions can serve as instrumental variables, enabling efficient demand learning with deep neural networks.

Abstract: We study the pricing behavior of third-party platforms facing strategic agents. Assuming the platform is a revenue maximizer, it observes market features that generally affect demand. Since only the equilibrium price and quantity are observable, this presents a general demand learning problem under confounding. Mathematically, we develop an algorithm with optimal regret of $\Tilde{\cO}(\sqrt{T}\wedgeσ_S^{-2})$. Our results reveal that supply-side noise fundamentally affects the learnability of demand, leading to a phase transition in regret. Technically, we show that non-i.i.d. actions can serve as instrumental variables for learning demand. We also propose a novel homeomorphic construction that allows us to establish estimation bounds without assuming star-shapedness, providing the first efficiency guarantee for learning demand with deep neural networks. Finally, we demonstrate the practical applicability of our approach through simulations and real-world data from Zomato and Lyft.

[542] A Micro-Macro Machine Learning Framework for Predicting Childhood Obesity Risk Using NHANES and Environmental Determinants

Eswarasanthosh Kumar Mamillapalli, Nishtha Sharma

Main category: cs.LG

TL;DR: A micro-macro ML framework integrates individual health data with environmental factors to predict childhood obesity, showing strong geographic alignment between environmental vulnerability and obesity risk.

Details

Motivation: Childhood obesity is influenced by multiple levels of risk factors, but traditional studies analyze these levels independently, limiting understanding of how structural environmental conditions interact with individual characteristics to influence health outcomes.

Method: Developed a micro-macro machine learning framework integrating: (1) individual-level anthropometric/socioeconomic data from NHANES, (2) macro-level structural environment features from USDA/EPA datasets. Trained four ML models (Logistic Regression, Random Forest, XGBoost, LightGBM) to predict obesity. Constructed a composite environmental vulnerability index (EnvScore) using normalized indicators at state level.

Result: XGBoost achieved strongest performance for obesity prediction. Multi-level comparison revealed strong geographic similarity between states with high environmental burden and nationally predicted micro-level obesity risk distribution, demonstrating feasibility of integrating multi-scale datasets.

Conclusion: The work contributes a scalable, data-driven, multi-level modeling pipeline for public health informatics with strong potential for expansion into causal modeling, intervention planning, and real-time analytics to identify environment-driven disparities in obesity risk.

Abstract: Childhood obesity remains a major public health challenge in the United States, strongly influenced by a combination of individual-level, household-level, and environmental-level risk factors. Traditional epidemiological studies typically analyze these levels independently, limiting insights into how structural environmental conditions interact with individual-level characteristics to influence health outcomes. In this study, we introduce a micro-macro machine learning framework that integrates (1) individual-level anthropometric and socioeconomic data from NHANES and (2) macro-level structural environment features, including food access, air quality, and socioeconomic vulnerability extracted from USDA and EPA datasets. Four machine learning models Logistic Regression, Random Forest, XGBoost, and LightGBM were trained to predict obesity using NHANES microdata. XGBoost achieved the strongest performance. A composite environmental vulnerability index (EnvScore) was constructed using normalized indicators from USDA and EPA at the state level. Multi-level comparison revealed strong geographic similarity between states with high environmental burden and the nationally predicted micro-level obesity risk distribution. This demonstrates the feasibility of integrating multi-scale datasets to identify environment-driven disparities in obesity risk. This work contributes a scalable, data-driven, multi-level modeling pipeline suitable for public health informatics, demonstrating strong potential for expansion into causal modeling, intervention planning, and real-time analytics.

[543] Understanding the Mechanisms of Fast Hyperparameter Transfer

Nikhil Ghosh, Denny Wu, Alberto Bietti

Main category: cs.LG

TL;DR: The paper develops a framework for analyzing hyperparameter transfer across model scales, showing that fast transfer (where transfer-induced suboptimality vanishes faster than finite-scale performance gap) is equivalent to useful transfer for compute-optimal grid search, and explains μP’s fast transfer through a decomposition of optimization trajectories.

Details

Motivation: Standard hyperparameter optimization becomes prohibitively expensive for large deep learning models. The paper aims to understand the principles behind scale-aware hyperparameter transfer strategies that allow transferring optimal HPs from small-scale grid searches to large models with minimal performance loss.

Method: Develops a conceptual framework for reasoning about HP transfer across scale, defines “fast transfer” (transfer-induced suboptimality vanishes asymptotically faster than finite-scale performance gap), proves equivalence between fast transfer and useful transfer for compute-optimal grid search, analyzes μP’s transfer properties, and proposes a decomposition hypothesis of optimization trajectories into width-stable and width-sensitive components.

Result: Shows that fast transfer is equivalent to useful transfer for compute-optimal grid search, meaning transfer is asymptotically more compute-efficient than direct tuning. Demonstrates that μP’s fast transfer depends on problem structure, with synthetic examples showing both successful and failed transfer. Provides empirical evidence supporting the decomposition hypothesis across various settings including large language model pretraining.

Conclusion: The paper provides a theoretical framework for understanding hyperparameter transfer across model scales, showing when and why transfer strategies work, and offers an explanation for μP’s observed fast transfer through decomposition of optimization trajectories into stable and sensitive components.

Abstract: The growing scale of deep learning models has rendered standard hyperparameter (HP) optimization prohibitively expensive. A promising solution is the use of scale-aware hyperparameters, which can enable direct transfer of optimal HPs from small-scale grid searches to large models with minimal performance loss. To understand the principles governing such transfer strategy, we develop a general conceptual framework for reasoning about HP transfer across scale, characterizing transfer as fast when the suboptimality it induces vanishes asymptotically faster than the finite-scale performance gap. We show formally that fast transfer is equivalent to useful transfer for compute-optimal grid search, meaning that transfer is asymptotically more compute-efficient than direct tuning. While empirical work has found that the Maximal Update Parameterization ($μ$P) exhibits fast transfer when scaling model width, the mechanisms remain poorly understood. We show that this property depends critically on problem structure by presenting synthetic settings where transfer either offers provable computational advantage or fails to outperform direct tuning even under $μ$P. To explain the fast transfer observed in practice, we conjecture that decomposing the optimization trajectory reveals two contributions to loss reduction: (1) a width-stable component that determines the optimal HPs, and (2) a width-sensitive component that improves with width but weakly perturbs the HP optimum. We present empirical evidence for this hypothesis across various settings, including large language model pretraining.

[544] GRExplainer: A Universal Explanation Method for Temporal Graph Neural Networks

Xuyan Li, Jie Wang, Zheng Yan

Main category: cs.LG

TL;DR: GRExplainer is a universal, efficient, and user-friendly explanation method for Temporal Graph Neural Networks that addresses limitations of existing TGNN explainability approaches.

Details

Motivation: Current TGNN explanation methods have three key issues: they are tailored to specific TGNN types (lacking generality), have high computational costs (unsuitable for large-scale networks), and overlook structural connectivity while requiring prior knowledge (reducing user-friendliness).

Method: GRExplainer extracts node sequences as a unified feature representation, making it applicable to both snapshot-based and event-based TGNNs. It uses breadth-first search and temporal information to construct input node sequences for efficiency, and employs a generative model based on RNNs for automated, continuous explanation generation.

Result: Experiments on six real-world datasets with three target TGNNs show that GRExplainer outperforms existing baseline methods in generality, efficiency, and user-friendliness.

Conclusion: GRExplainer successfully addresses the key limitations of current TGNN explanation methods by providing a universal, efficient, and user-friendly solution that works across different TGNN types and scales effectively to real-world networks.

Abstract: Dynamic graphs are widely used to represent evolving real-world networks. Temporal Graph Neural Networks (TGNNs) have emerged as a powerful tool for processing such graphs, but the lack of transparency and explainability limits their practical adoption. Research on TGNN explainability is still in its early stages and faces several key issues: (i) Current methods are tailored to specific TGNN types, restricting generality. (ii) They suffer from high computational costs, making them unsuitable for large-scale networks. (iii) They often overlook the structural connectivity of explanations and require prior knowledge, reducing user-friendliness. To address these issues, we propose GRExplainer, the first universal, efficient, and user-friendly explanation method for TGNNs. GRExplainer extracts node sequences as a unified feature representation, making it independent of specific input formats and thus applicable to both snapshot-based and event-based TGNNs (the major types of TGNNs). By utilizing breadth-first search and temporal information to construct input node sequences, GRExplainer reduces redundant computation and improves efficiency. To enhance user-friendliness, we design a generative model based on Recurrent Neural Networks (RNNs), enabling automated and continuous explanation generation. Experiments on six real-world datasets with three target TGNNs show that GRExplainer outperforms existing baseline methods in generality, efficiency, and user-friendliness.

[545] Adapting, Fast and Slow: Transportable Circuits for Few-Shot Learning

Kasra Jalaldoust, Elias Bareinboim

Main category: cs.LG

TL;DR: Circuit-TR enables zero-shot compositional generalization using causal graphs and discrepancy oracles to transport modular predictors across domains.

Details

Motivation: Generalization across domains requires structured constraints on unseen target domains relative to source domains, addressing the challenge of zero-shot compositional generalization.

Method: Uses causal transportability theory with qualitative domain knowledge (causal graphs for intra-domain structure, discrepancy oracles for inter-domain mechanism sharing). Learns modules from source data and transports/composes them into circuits for target domain prediction when causal structure permits. Also develops supervised domain adaptation without explicit causal structure using limited target data.

Result: Theoretical characterization of few-shot learnable tasks via graphical circuit transportability criteria, connecting few-shot generalizability with circuit size complexity. Controlled simulations validate theoretical results.

Conclusion: Circuit transportability provides a principled framework for zero-shot and few-shot generalization across domains by leveraging causal structure and modular composition.

Abstract: Generalization across the domains is not possible without asserting a structure that constrains the unseen target domain w.r.t. the source domain. Building on causal transportability theory, we design an algorithm for zero-shot compositional generalization which relies on access to qualitative domain knowledge in form of a causal graph for intra-domain structure and discrepancies oracle for inter-domain mechanism sharing. \textit{Circuit-TR} learns a collection of modules (i.e., local predictors) from the source data, and transport/compose them to obtain a circuit for prediction in the target domain if the causal structure licenses. Furthermore, circuit transportability enables us to design a supervised domain adaptation scheme that operates without access to an explicit causal structure, and instead uses limited target data. Our theoretical results characterize classes of few-shot learnable tasks in terms of graphical circuit transportability criteria, and connects few-shot generalizability with the established notion of circuit size complexity; controlled simulations corroborate our theoretical results.

[546] Schrodinger AI: A Unified Spectral-Dynamical Framework for Classification, Reasoning, and Operator-Based Generalization

Truong Son Nguyen

Main category: cs.LG

TL;DR: Schrödinger AI is a quantum mechanics-inspired ML framework with three components: wave-energy solver for perception, dynamical solver for time evolution, and operator calculus for symbolic transformations, offering physics-driven alternative to conventional ML.

Details

Motivation: To create a unified machine learning framework inspired by quantum mechanics that provides robust generalization, interpretable semantics, and emergent topology as an alternative to conventional cross-entropy training and transformer attention.

Method: Three tightly coupled components: 1) Time-independent wave-energy solver treating perception/classification as spectral decomposition under learned Hamiltonian; 2) Time-dependent dynamical solver governing semantic wavefunction evolution for context-aware decision making; 3) Low-rank operator calculus learning symbolic transformations through quantum-like transition operators.

Result: Demonstrates: a) Emergent semantic manifolds reflecting human-conceived class relations without supervision; b) Dynamic reasoning adapting to changing environments (maze navigation with real-time perturbations); c) Exact operator generalization on modular arithmetic tasks, learning group actions and composing them beyond training length.

Conclusion: Suggests a new foundational direction for ML where learning is cast as discovering and navigating an underlying semantic energy landscape, providing physics-driven alternative with robust generalization and interpretable semantics.

Abstract: We introduce \textbf{Schrödinger AI}, a unified machine learning framework inspired by quantum mechanics. The system is defined by three tightly coupled components: (1) a {time-independent wave-energy solver} that treats perception and classification as spectral decomposition under a learned Hamiltonian; (2) a {time-dependent dynamical solver} governing the evolution of semantic wavefunctions over time, enabling context-aware decision revision, re-routing, and reasoning under environmental changes; and (3) a {low-rank operator calculus} that learns symbolic transformations such as modular arithmetic through learned quantum-like transition operators. Together, these components form a coherent physics-driven alternative to conventional cross-entropy training and transformer attention, providing robust generalization, interpretable semantics, and emergent topology. Empirically, Schrödinger AI demonstrates: (a) emergent semantic manifolds that reflect human-conceived class relations without explicit supervision; (b) dynamic reasoning that adapts to changing environments, including maze navigation with real-time potential-field perturbations; and (c) exact operator generalization on modular arithmetic tasks, where the system learns group actions and composes them across sequences far beyond training length. These results suggest a new foundational direction for machine learning, where learning is cast as discovering and navigating an underlying semantic energy landscape.

[547] SNM-Net: A Universal Framework for Robust Open-Set Gas Recognition via Spherical Normalization and Mahalanobis Distance

Shuai Chen, Chen Wang, Ziran Wang

Main category: cs.LG

TL;DR: SNM-Net is a universal deep learning framework for open-set gas recognition that addresses signal drift and unknown interference through geometric decoupling and Mahalanobis distance scoring, achieving state-of-the-art performance on E-nose systems.

Details

Motivation: Electronic nose systems face dual challenges: feature distribution shifts from signal drift and decision failures from unknown interference. Existing methods using Euclidean distance fail to account for anisotropic gas feature distributions and dynamic signal intensity variations.

Method: Proposes SNM-Net with geometric decoupling via cascaded batch normalization and L2 normalization to project features onto a unit hypersphere, eliminating intensity fluctuations. Uses Mahalanobis distance as scoring mechanism with class-wise statistics to construct adaptive ellipsoidal decision boundaries. Architecture-agnostic and works with CNN, RNN, and Transformer backbones.

Result: Transformer+SNM achieves near-theoretical performance: AUROC of 0.9977 and unknown gas detection rate of 99.57% (TPR at 5% FPR). Outperforms state-of-the-art with 3.0% AUROC improvement and 91.0% reduction in standard deviation compared to Class Anchor Clustering. Exceptional robustness across sensor positions with standard deviations below 0.0028.

Conclusion: SNM-Net effectively resolves the trade-off between accuracy and stability in open-set gas recognition, providing a solid technical foundation for industrial E-nose deployment by addressing signal drift and unknown interference through geometric decoupling and adaptive decision boundaries.

Abstract: Electronic nose (E-nose) systems face dual challenges in open-set gas recognition: feature distribution shifts caused by signal drift and decision failures induced by unknown interference. Existing methods predominantly rely on Euclidean distance, failing to adequately account for anisotropic gas feature distributions and dynamic signal intensity variations. To address these issues, this study proposes SNM-Net, a universal deep learning framework for open-set gas recognition. The core innovation lies in a geometric decoupling mechanism achieved through cascaded batch normalization and L2 normalization, which projects high-dimensional features onto a unit hypersphere to eliminate signal intensity fluctuations. Additionally, Mahalanobis distance is introduced as the scoring mechanism, utilizing class-wise statistics to construct adaptive ellipsoidal decision boundaries. SNM-Net is architecture-agnostic and seamlessly integrates with CNN, RNN, and Transformer backbones. Systematic experiments on the Vergara dataset demonstrate that the Transformer+SNM configuration attains near-theoretical performance, achieving an AUROC of 0.9977 and an unknown gas detection rate of 99.57% (TPR at 5% FPR). This performance significantly outperforms state-of-the-art methods, showing a 3.0% improvement in AUROC and a 91.0% reduction in standard deviation compared to Class Anchor Clustering. The framework exhibits exceptional robustness across sensor positions with standard deviations below 0.0028. This work effectively resolves the trade-off between accuracy and stability, providing a solid technical foundation for industrial E-nose deployment.

[548] Discovering Transmission Dynamics of COVID-19 in China

Zhou Yang, Edward Dougherty, Chen Zhang, Zhenhe Pan, Fang Jin

Main category: cs.LG

TL;DR: Analysis of China’s COVID-19 transmission patterns using public tracking data reveals regional differences, rapid hospitalization of symptomatic cases, and shifting infection sources from travel-related to social activities over time.

Details

Motivation: To identify effective public health interventions by comprehensively analyzing SARS-CoV-2 transmission patterns in China through retrospective study of tracking data, which can inform future pandemic response strategies.

Method: Collected case reports from local health commissions, Chinese CDC, and official government social media; applied NLP and manual curation to construct transmission/tracking chains; analyzed tracking data with Wuhan population mobility data to quantify temporal and spatial spread dynamics.

Result: Substantial regional differences with larger cities showing more infections driven by social activities; 79% of symptomatic individuals hospitalized within 5 days of symptom onset; confirmed-case contacts sought admission in under 5 days; infection sources shifted from travel-related (Hubei Province) early on to social activities later.

Conclusion: The study demonstrates how comprehensive analysis of transmission patterns can identify effective public health interventions, with findings showing rapid hospitalization response and evolving transmission dynamics that should inform future pandemic preparedness strategies.

Abstract: A comprehensive retrospective analysis of public health interventions, such as large scale testing, quarantining, and contact tracing, can help identify mechanisms most effective in mitigating COVID-19. We investigate China based SARS-CoV-2 transmission patterns (e.g., infection type and likely transmission source) using publicly released tracking data. We collect case reports from local health commissions, the Chinese CDC, and official local government social media, then apply NLP and manual curation to construct transmission/tracking chains. We further analyze tracking data together with Wuhan population mobility data to quantify and visualize temporal and spatial spread dynamics. Results indicate substantial regional differences, with larger cities showing more infections, likely driven by social activities. Most symptomatic individuals (79%) were hospitalized within 5 days of symptom onset, and those with confirmed-case contact sought admission in under 5 days. Infection sources also shifted over time: early cases were largely linked to travel to (or contact with travelers from) Hubei Province, while later transmission was increasingly associated with social activities.

[549] ReDiF: Reinforced Distillation for Few Step Diffusion

Amirhossein Tighkhorshid, Zahra Dehghanian, Gholamali Aminian, Chengchun Shi, Hamid R. Rabiee

Main category: cs.LG

TL;DR: RL-based distillation framework for diffusion models that treats distillation as policy optimization, using reward signals from teacher alignment to guide students to take longer, optimized steps toward high-probability data regions.

Details

Motivation: To address the slow sampling problem in diffusion models by creating more efficient models through distillation, but moving beyond fixed reconstruction/consistency losses to a more flexible, dynamic optimization approach.

Method: Treats distillation as a reinforcement learning policy optimization problem where the student is trained using reward signals derived from alignment with teacher outputs, allowing dynamic exploration of multiple denoising paths and longer optimized steps.

Result: Achieves superior performance with significantly fewer inference steps and computational resources compared to existing distillation techniques, while being model-agnostic and applicable to any diffusion model with suitable reward functions.

Conclusion: The RL-driven distillation framework provides a general optimization paradigm for efficient diffusion learning that dynamically guides students to explore optimal denoising paths rather than relying on incremental refinements.

Abstract: Distillation addresses the slow sampling problem in diffusion models by creating models with smaller size or fewer steps that approximate the behavior of high-step teachers. In this work, we propose a reinforcement learning based distillation framework for diffusion models. Instead of relying on fixed reconstruction or consistency losses, we treat the distillation process as a policy optimization problem, where the student is trained using a reward signal derived from alignment with the teacher’s outputs. This RL driven approach dynamically guides the student to explore multiple denoising paths, allowing it to take longer, optimized steps toward high-probability regions of the data distribution, rather than relying on incremental refinements. Our framework utilizes the inherent ability of diffusion models to handle larger steps and effectively manage the generative process. Experimental results show that our method achieves superior performance with significantly fewer inference steps and computational resources compared to existing distillation techniques. Additionally, the framework is model agnostic, applicable to any type of diffusion models with suitable reward functions, providing a general optimization paradigm for efficient diffusion learning.

[550] MoR: Mixture Of Representations For Mixed-Precision Training

Bor-Yiing Su, Peter Dykas, Mike Chrzanowski, Jatin Chhugani

Main category: cs.LG

TL;DR: MoR is a dynamic per-tensor/sub-tensor quantization framework that analyzes tensor properties to select between FP8 and BF16 representations, achieving 98.38% FP8 quantization while preserving model quality.

Details

Motivation: Mixed-precision training is essential for scaling deep learning models, but requires careful selection of training methods. Current approaches lack dynamic, property-aware quantization that can adapt to tensor characteristics while maintaining model quality.

Method: Mixture-of-Representations (MoR) framework dynamically analyzes tensor numerical properties to select between different representations (FP8/BF16) at per-tensor and sub-tensor granularities. Universal approach preserves quality across quantization strategies and datasets.

Result: Achieves state-of-the-art results with 98.38% of tensors quantized to FP8 format. Maintains FP8 accuracies comparable to existing approaches without fine-grain partitioning, and enables potential use with even lower precision formats like NVFP4.

Conclusion: Dynamic, property-aware quantization shows strong potential for improving low-precision training robustness. MoR framework can enhance mixed-precision training by intelligently selecting representations based on tensor properties while preserving model quality.

Abstract: Mixed-precision training is a crucial technique for scaling deep learning models, but successful mixedprecision training requires identifying and applying the right combination of training methods. This paper presents our preliminary study on Mixture-of-Representations (MoR), a novel, per-tensor and sub-tensor level quantization framework that dynamically analyzes a tensor’s numerical properties to select between a variety of different representations. Based on the framework, we have proposed and experimented concrete algorithms that choose dynamically between FP8 and BF16 representations for both per-tensor and sub-tensor level granularities. Our universal approach is designed to preserve model quality across various quantization partition strategies and datasets. Our initial findings show that this approach can achieve state-of-the-art results with 98.38% of tensors quantized to the FP8 format. This work highlights the potential of dynamic, property-aware quantization while preserving model quality. We believe this approach can generally improve the robustness of low precision training, as demonstrated by achieving FP8 accuracies that are on par with existing approaches without the need for fine-grain partitioning, or can be used in combination with other training methods to improve the leverage of even lower precision number formats such as NVFP4.

[551] Long-Range Distillation: Distilling 10,000 Years of Simulated Climate into Long Timestep AI Weather Models

Scott A. Martin, Noah Brenowitz, Dale Durran, Michael Pritchard

Main category: cs.LG

TL;DR: Long-range distillation trains probabilistic student models using synthetic climate data from autoregressive teacher models to enable accurate long-range weather forecasting in a single step.

Details

Motivation: Autoregressive AI weather models suffer from error accumulation and instability at long lead times, while long-timestep probabilistic models overfit on limited reanalysis data (only 40 years).

Method: Use long-range distillation: generate massive synthetic climate data (10,000+ years) from an autoregressive teacher model (DLESyM), then train probabilistic student models to forecast directly at long-range in a single timestep.

Result: Distilled models outperform climatology, approach teacher model skill while replacing hundreds of autoregressive steps with one step, achieve S2S skill comparable to ECMWF ensemble after ERA5 fine-tuning, and scale with more synthetic data.

Conclusion: AI-generated synthetic training data can scale long-range forecast skill, enabling accurate probabilistic forecasting at subseasonal-to-seasonal timescales with single-step models.

Abstract: Accurate long-range weather forecasting remains a major challenge for AI models, both because errors accumulate over autoregressive rollouts and because reanalysis datasets used for training offer a limited sample of the slow modes of climate variability underpinning predictability. Most AI weather models are autoregressive, producing short lead forecasts that must be repeatedly applied to reach subseasonal-to-seasonal (S2S) or seasonal lead times, often resulting in instability and calibration issues. Long-timestep probabilistic models that generate long-range forecasts in a single step offer an attractive alternative, but training on the 40-year reanalysis record leads to overfitting, suggesting orders of magnitude more training data are required. We introduce long-range distillation, a method that trains a long-timestep probabilistic “student” model to forecast directly at long-range using a huge synthetic training dataset generated by a short-timestep autoregressive “teacher” model. Using the Deep Learning Earth System Model (DLESyM) as the teacher, we generate over 10,000 years of simulated climate to train distilled student models for forecasting across a range of timescales. In perfect-model experiments, the distilled models outperform climatology and approach the skill of their autoregressive teacher while replacing hundreds of autoregressive steps with a single timestep. In the real world, they achieve S2S forecast skill comparable to the ECMWF ensemble forecast after ERA5 fine-tuning. The skill of our distilled models scales with increasing synthetic training data, even when that data is orders of magnitude larger than ERA5. This represents the first demonstration that AI-generated synthetic training data can be used to scale long-range forecast skill.

[552] TEACH: Temporal Variance-Driven Curriculum for Reinforcement Learning

Gaurav Chaudhary, Laxmidhar Behera

Main category: cs.LG

TL;DR: Proposes a Student-Teacher learning paradigm with Temporal Variance-Driven Curriculum to accelerate Goal-Conditioned RL by dynamically prioritizing high-uncertainty goals.

Details

Motivation: Standard uniform goal selection in multi-goal RL is sample inefficient. Biological systems show adaptive, structured learning that could inspire more efficient goal-conditioned RL.

Method: Student-Teacher framework where teacher module dynamically selects goals with highest temporal variance in policy’s Q-values. This targets high-uncertainty goals to provide adaptive learning signals. Method is algorithm-agnostic and integrates with existing RL frameworks.

Result: Evaluated on 11 diverse robotic manipulation and maze navigation tasks. Shows consistent and notable improvements over state-of-the-art curriculum learning and goal-selection methods.

Conclusion: Temporal variance-driven curriculum learning effectively accelerates goal-conditioned RL by adaptively focusing on high-uncertainty goals, inspired by biological learning principles.

Abstract: Reinforcement Learning (RL) has achieved significant success in solving single-goal tasks. However, uniform goal selection often results in sample inefficiency in multi-goal settings where agents must learn a universal goal-conditioned policy. Inspired by the adaptive and structured learning processes observed in biological systems, we propose a novel Student-Teacher learning paradigm with a Temporal Variance-Driven Curriculum to accelerate Goal-Conditioned RL. In this framework, the teacher module dynamically prioritizes goals with the highest temporal variance in the policy’s confidence score, parameterized by the state-action value (Q) function. The teacher provides an adaptive and focused learning signal by targeting these high-uncertainty goals, fostering continual and efficient progress. We establish a theoretical connection between the temporal variance of Q-values and the evolution of the policy, providing insights into the method’s underlying principles. Our approach is algorithm-agnostic and integrates seamlessly with existing RL frameworks. We demonstrate this through evaluation across 11 diverse robotic manipulation and maze navigation tasks. The results show consistent and notable improvements over state-of-the-art curriculum learning and goal-selection methods.

[553] Fundamental Novel Consistency Theory: $H$-Consistency Bounds

Yutao Zhong

Main category: cs.LG

TL;DR: The paper introduces H-consistency bounds, which provide stronger guarantees than Bayes-consistency or H-calibration for surrogate losses in machine learning, analyzing these bounds across binary and multi-class classification with various surrogates including convex losses, adversarial settings, and comp-sum losses.

Details

Motivation: In machine learning, surrogate losses are often optimized instead of target losses due to computational intractability or lack of differentiability, creating a gap between what's optimized and what truly matters for task performance. The paper aims to provide stronger theoretical guarantees about this gap through H-consistency bounds.

Method: The authors develop a comprehensive framework for deriving H-consistency bounds, starting with binary classification (distribution-dependent and -independent bounds), then extending to multi-class classification. They analyze various surrogate families including convex surrogates, adversarial settings, max/sum/constrained losses, and comp-sum losses (cross-entropy, MAE). They also examine growth rates and minimizability gaps.

Result: The paper establishes tight H-consistency bounds for multiple loss families, shows that non-trivial bounds are sometimes unattainable, introduces smooth adversarial variants of comp-sum losses for robust learning, proves universal square-root growth rates for smooth surrogates, and provides analysis of minimizability gaps to guide surrogate selection.

Conclusion: H-consistency bounds offer stronger theoretical guarantees than existing frameworks for surrogate losses, providing a comprehensive analysis across binary and multi-class classification with practical implications for surrogate selection and robust algorithm design.

Abstract: In machine learning, the loss functions optimized during training often differ from the target loss that defines task performance due to computational intractability or lack of differentiability. We present an in-depth study of the target loss estimation error relative to the surrogate loss estimation error. Our analysis leads to $H$-consistency bounds, which are guarantees accounting for the hypothesis set $H$. These bounds offer stronger guarantees than Bayes-consistency or $H$-calibration and are more informative than excess error bounds. We begin with binary classification, establishing tight distribution-dependent and -independent bounds. We provide explicit bounds for convex surrogates (including linear models and neural networks) and analyze the adversarial setting for surrogates like $ρ$-margin and sigmoid loss. Extending to multi-class classification, we present the first $H$-consistency bounds for max, sum, and constrained losses, covering both non-adversarial and adversarial scenarios. We demonstrate that in some cases, non-trivial $H$-consistency bounds are unattainable. We also investigate comp-sum losses (e.g., cross-entropy, MAE), deriving their first $H$-consistency bounds and introducing smooth adversarial variants that yield robust learning algorithms. We develop a comprehensive framework for deriving these bounds across various surrogates, introducing new characterizations for constrained and comp-sum losses. Finally, we examine the growth rates of $H$-consistency bounds, establishing a universal square-root growth rate for smooth surrogates in binary and multi-class tasks, and analyze minimizability gaps to guide surrogate selection.

[554] Theory and Algorithms for Learning with Multi-Class Abstention and Multi-Expert Deferral

Anqi Mao

Main category: cs.LG

TL;DR: This thesis studies learning with multiple-expert deferral systems to address LLM hallucinations and high inference costs, providing comprehensive theoretical guarantees and practical algorithms for classification and regression tasks.

Details

Motivation: Large language models face critical challenges of hallucinations and high inference costs. Using multiple experts offers a solution: deferring uncertain inputs to more capable experts improves reliability, while routing simpler queries to smaller distilled models enhances efficiency.

Method: 1) For learning with abstention (special case of deferral): analyze score-based and predictor-rejector formulations, introduce new surrogate losses with consistency guarantees. 2) For multi-expert deferral: design new surrogate losses for single-stage and two-stage scenarios with H-consistency bounds. 3) For regression with deferral: introduce novel framework for continuous label spaces with multiple experts and various cost structures.

Result: Resolves two open questions in abstention learning with strong non-asymptotic consistency guarantees. Demonstrates superior performance on CIFAR-10, CIFAR-100, and SVHN datasets. Provides effective new algorithms with proven H-consistency for both classification and regression deferral problems.

Conclusion: The thesis presents a comprehensive study of learning with multiple-expert deferral, offering strong theoretical guarantees and practical algorithms that address LLM challenges through intelligent routing between experts, with applications spanning both classification and regression tasks.

Abstract: Large language models (LLMs) have achieved remarkable performance but face critical challenges: hallucinations and high inference costs. Leveraging multiple experts offers a solution: deferring uncertain inputs to more capable experts improves reliability, while routing simpler queries to smaller, distilled models enhances efficiency. This motivates the problem of learning with multiple-expert deferral. This thesis presents a comprehensive study of this problem and the related problem of learning with abstention, supported by strong consistency guarantees. First, for learning with abstention (a special case of deferral), we analyze score-based and predictor-rejector formulations in multi-class classification. We introduce new families of surrogate losses and prove strong non-asymptotic, hypothesis set-specific consistency guarantees, resolving two existing open questions. We analyze both single-stage and practical two-stage settings, with experiments on CIFAR-10, CIFAR-100, and SVHN demonstrating the superior performance of our algorithms. Second, we address general multi-expert deferral in classification. We design new surrogate losses for both single-stage and two-stage scenarios and prove they benefit from strong $H$-consistency bounds. For the two-stage scenario, we show that our surrogate losses are realizable $H$-consistent for constant cost functions, leading to effective new algorithms. Finally, we introduce a novel framework for regression with deferral to address continuous label spaces. Our versatile framework accommodates multiple experts and various cost structures, supporting both single-stage and two-stage methods. It subsumes recent work on regression with abstention. We propose new surrogate losses with proven $H$-consistency and demonstrate the empirical effectiveness of the resulting algorithms.

[555] MetaCD: A Meta Learning Framework for Cognitive Diagnosis based on Continual Learning

Jin Wu, Chanjin Zheng

Main category: cs.LG

TL;DR: MetaCD: A meta-learning framework for cognitive diagnosis that addresses long-tailed data distribution and dynamic changes using continual learning techniques.

Details

Motivation: Existing deep learning models for cognitive diagnosis are limited by long-tailed data distributions and dynamic changes in student-skill-question interactions, requiring better adaptation to new tasks with limited data.

Method: Proposes MetaCD framework combining meta-learning for optimal initialization (to handle long-tailed data) with continual learning using parameter protection mechanism (to adapt to new skills/tasks dynamically).

Result: MetaCD outperforms other baselines on five real-world datasets in both accuracy and generalization, demonstrating improved plasticity, stability, and adaptation capabilities.

Conclusion: MetaCD effectively addresses key challenges in cognitive diagnosis by combining meta-learning and continual learning, providing a robust framework for adaptive educational assessment systems.

Abstract: Cognitive diagnosis is an essential research topic in intelligent education, aimed at assessing the level of mastery of different skills by students. So far, many research works have used deep learning models to explore the complex interactions between students, questions, and skills. However, the performance of existing method is frequently limited by the long-tailed distribution and dynamic changes in the data. To address these challenges, we propose a meta-learning framework for cognitive diagnosis based on continual learning (MetaCD). This framework can alleviate the long-tailed problem by utilizing meta-learning to learn the optimal initialization state, enabling the model to achieve good accuracy on new tasks with only a small amount of data. In addition, we utilize a continual learning method named parameter protection mechanism to give MetaCD the ability to adapt to new skills or new tasks, in order to adapt to dynamic changes in data. MetaCD can not only improve the plasticity of our model on a single task, but also ensure the stability and generalization of the model on sequential tasks. Comprehensive experiments on five real-world datasets show that MetaCD outperforms other baselines in both accuracy and generalization.

[556] Sat-EnQ: Satisficing Ensembles of Weak Q-Learners for Reliable and Compute-Efficient Reinforcement Learning

Ünver Çiftçi

Main category: cs.LG

TL;DR: Sat-EnQ: A two-phase deep Q-learning framework that first learns to be “good enough” (satisficing) with an ensemble of lightweight networks, then distills and fine-tunes with standard Double DQN, achieving significant variance reduction and eliminating catastrophic failures.

Details

Motivation: Deep Q-learning algorithms are notoriously unstable, especially during early training when the maximization operator amplifies estimation errors, leading to catastrophic overestimation and failures.

Method: Two-phase framework: Phase 1 trains an ensemble of lightweight Q-networks under a satisficing objective that limits early value growth using a dynamic baseline, producing diverse, low-variance estimates. Phase 2 distills the ensemble into a larger network and fine-tunes with standard Double DQN.

Result: Achieves 3.8x variance reduction, eliminates catastrophic failures (0% vs 50% for DQN), maintains 79% performance under environmental noise, and requires 2.5x less compute than bootstrapped ensembles.

Conclusion: Sat-EnQ provides a principled path toward robust reinforcement learning by embracing satisficing before optimization, inspired by bounded rationality theory and developmental learning.

Abstract: Deep Q-learning algorithms remain notoriously unstable, especially during early training when the maximization operator amplifies estimation errors. Inspired by bounded rationality theory and developmental learning, we introduce Sat-EnQ, a two-phase framework that first learns to be ``good enough’’ before optimizing aggressively. In Phase 1, we train an ensemble of lightweight Q-networks under a satisficing objective that limits early value growth using a dynamic baseline, producing diverse, low-variance estimates while avoiding catastrophic overestimation. In Phase 2, the ensemble is distilled into a larger network and fine-tuned with standard Double DQN. We prove theoretically that satisficing induces bounded updates and cannot increase target variance, with a corollary quantifying conditions for substantial reduction. Empirically, Sat-EnQ achieves 3.8x variance reduction, eliminates catastrophic failures (0% vs 50% for DQN), maintains 79% performance under environmental noise}, and requires 2.5x less compute than bootstrapped ensembles. Our results highlight a principled path toward robust reinforcement learning by embracing satisficing before optimization.

[557] Multiple Token Divergence: Measuring and Steering In-Context Computation Density

Vincent Herrmann, Eric Alcaide, Michael Wand, Jürgen Schmidhuber

Main category: cs.LG

TL;DR: MTD is a simple KL divergence measure between full model output and shallow auxiliary head output that quantifies computational effort without training, enabling analysis and steering of LM computational dynamics.

Details

Motivation: Existing metrics like next-token loss fail to capture reasoning complexity, and prior methods based on latent state compressibility are invasive and unstable. There's a need for a practical, lightweight tool to measure and control computational effort in language models.

Method: Proposes Multiple Token Divergence (MTD) - KL divergence between model’s full output distribution and shallow auxiliary prediction head output. Also introduces Divergence Steering, a decoding method to control computational character of generated text using MTD.

Result: MTD outperforms prior methods at distinguishing complex vs simple tasks. On math reasoning benchmarks, MTD positively correlates with problem difficulty. Lower MTD is associated with more accurate reasoning.

Conclusion: MTD provides a practical, lightweight tool for analyzing and steering computational dynamics of language models, requiring no additional training and working directly with pre-trained models with multiple prediction heads.

Abstract: Measuring the in-context computational effort of language models is a key challenge, as metrics like next-token loss fail to capture reasoning complexity. Prior methods based on latent state compressibility can be invasive and unstable. We propose Multiple Token Divergence (MTD), a simple measure of computational effort defined as the KL divergence between a model’s full output distribution and that of a shallow, auxiliary prediction head. MTD can be computed directly from pre-trained models with multiple prediction heads, requiring no additional training. Building on this, we introduce Divergence Steering, a novel decoding method to control the computational character of generated text. We empirically show that MTD is more effective than prior methods at distinguishing complex tasks from simple ones. On mathematical reasoning benchmarks, MTD correlates positively with problem difficulty. Lower MTD is associated with more accurate reasoning. MTD provides a practical, lightweight tool for analyzing and steering the computational dynamics of language models.

[558] APO: Alpha-Divergence Preference Optimization

Wang Zixian

Main category: cs.LG

TL;DR: APO is an anchored preference optimization framework that uses alpha-divergence to interpolate between forward and reverse KL behaviors, enabling smooth transition from mode-covering to mode-seeking optimization with stability.

Details

Motivation: Current alignment methods are limited by committing to either forward KL (mode-covering but under-exploiting) or reverse KL (mode-seeking but risking collapse). There's a need for a unified framework that can smoothly transition between these regimes while maintaining stability.

Method: APO uses Csiszar alpha-divergence within anchored coordinates to continuously interpolate between forward and reverse KL behaviors. It includes a reward-and-confidence-guarded alpha schedule that transitions from coverage to exploitation only when policy is improving and confidently calibrated.

Result: Experiments on Qwen3-1.7B with math-level3 show APO achieves competitive performance with GRPO and GSPO baselines while maintaining training stability.

Conclusion: APO provides a principled framework for smoothly transitioning between divergence regimes in preference optimization, offering both competitive performance and training stability through its adaptive alpha scheduling mechanism.

Abstract: Two divergence regimes dominate modern alignment practice. Supervised fine-tuning and many distillation-style objectives implicitly minimize the forward KL divergence KL(q || pi_theta), yielding stable mode-covering updates but often under-exploiting high-reward modes. In contrast, PPO-style online reinforcement learning from human feedback behaves closer to reverse KL divergence KL(pi_theta || q), enabling mode-seeking improvements but risking mode collapse. Recent anchored methods, such as ADPO, show that performing the projection in anchored coordinates can substantially improve stability, yet they typically commit to a single divergence. We introduce Alpha-Divergence Preference Optimization (APO), an anchored framework that uses Csiszar alpha-divergence to continuously interpolate between forward and reverse KL behavior within the same anchored geometry. We derive unified gradient dynamics parameterized by alpha, analyze gradient variance properties, and propose a practical reward-and-confidence-guarded alpha schedule that transitions from coverage to exploitation only when the policy is both improving and confidently calibrated. Experiments on Qwen3-1.7B with math-level3 demonstrate that APO achieves competitive performance with GRPO and GSPO baselines while maintaining training stability.

[559] FLOW: A Feedback-Driven Synthetic Longitudinal Dataset of Work and Wellbeing

Wafaa El Husseini

Main category: cs.LG

TL;DR: FLOW is a synthetic longitudinal dataset simulating daily interactions between workload, lifestyle behaviors, and wellbeing for 1,000 individuals over two years, designed to address privacy and access limitations in real-world data.

Details

Motivation: Privacy, ethical, and logistical constraints limit access to longitudinal, individual-level data on work-life balance and wellbeing, creating challenges for reproducible research, methodological benchmarking, and education in stress modeling and behavioral analysis.

Method: FLOW is generated using a rule-based, feedback-driven simulation that produces coherent temporal dynamics across variables like stress, sleep, mood, physical activity, and body weight. The dataset includes 1,000 individuals over two years with daily resolution, plus a configurable data generation tool for reproducible experimentation.

Result: A publicly available synthetic longitudinal dataset (FLOW) that simulates daily interactions between workload, lifestyle, and wellbeing variables, providing a controlled experimental environment for research when real-world data are inaccessible.

Conclusion: FLOW serves as a valuable resource for exploratory analysis, methodological development, and benchmarking in domains like stress modeling and behavioral analysis, offering a controlled alternative to real-world data while addressing privacy and access limitations.

Abstract: Access to longitudinal, individual-level data on work-life balance and wellbeing is limited by privacy, ethical, and logistical constraints. This poses challenges for reproducible research, methodological benchmarking, and education in domains such as stress modeling, behavioral analysis, and machine learning. We introduce FLOW, a synthetic longitudinal dataset designed to model daily interactions between workload, lifestyle behaviors, and wellbeing. FLOW is generated using a rule-based, feedback-driven simulation that produces coherent temporal dynamics across variables such as stress, sleep, mood, physical activity, and body weight. The dataset simulates 1{,}000 individuals over a two-year period with daily resolution and is released as a publicly available resource. In addition to the static dataset, we describe a configurable data generation tool that enables reproducible experimentation under adjustable behavioral and contextual assumptions. FLOW is intended as a controlled experimental environment rather than a proxy for observed human populations, supporting exploratory analysis, methodological development, and benchmarking where real-world data are inaccessible.

[560] A Context-Aware Temporal Modeling through Unified Multi-Scale Temporal Encoding and Hierarchical Sequence Learning for Single-Channel EEG Sleep Staging

Amirali Vakili, Salar Jahanshiri, Armin Salimi-Badr

Main category: cs.LG

TL;DR: A context-aware and interpretable framework for single-channel EEG sleep staging that improves N1 stage detection using multi-scale feature extraction, temporal modeling, and techniques to address class imbalance.

Details

Motivation: Automatic sleep staging is crucial for healthcare due to widespread sleep disorders. Single-channel EEG is practical but existing methods suffer from class imbalance (especially for N1 stage), limited receptive-field modeling, and lack of interpretability in black-box models.

Method: Combines compact multi-scale feature extraction with temporal modeling to capture local and long-range dependencies. Uses class-weighted loss functions and data augmentation to address data imbalance. Segments EEG signals into sub-epoch chunks and averages softmax probabilities across chunks for contextual representation and robustness.

Result: Achieves 89.72% overall accuracy and 85.46% macro-average F1-score. Notably achieves 61.7% F1-score for the challenging N1 stage, showing substantial improvement over previous methods on SleepEDF datasets.

Conclusion: The proposed approach effectively improves sleep staging performance while maintaining interpretability and suitability for real-world clinical applications, particularly demonstrating strong N1 stage detection capabilities.

Abstract: Automatic sleep staging is a critical task in healthcare due to the global prevalence of sleep disorders. This study focuses on single-channel electroencephalography (EEG), a practical and widely available signal for automatic sleep staging. Existing approaches face challenges such as class imbalance, limited receptive-field modeling, and insufficient interpretability. This work proposes a context-aware and interpretable framework for single-channel EEG sleep staging, with particular emphasis on improving detection of the N1 stage. Many prior models operate as black boxes with stacked layers, lacking clearly defined and interpretable feature extraction roles.The proposed model combines compact multi-scale feature extraction with temporal modeling to capture both local and long-range dependencies. To address data imbalance, especially in the N1 stage, classweighted loss functions and data augmentation are applied. EEG signals are segmented into sub-epoch chunks, and final predictions are obtained by averaging softmax probabilities across chunks, enhancing contextual representation and robustness.The proposed framework achieves an overall accuracy of 89.72% and a macro-average F1-score of 85.46%. Notably, it attains an F1- score of 61.7% for the challenging N1 stage, demonstrating a substantial improvement over previous methods on the SleepEDF datasets. These results indicate that the proposed approach effectively improves sleep staging performance while maintaining interpretability and suitability for real-world clinical applications.

[561] Fusion or Confusion? Multimodal Complexity Is Not All You Need

Tillmann Rheude, Roland Eils, Benjamin Wild

Main category: cs.LG

TL;DR: Complex multimodal learning methods don’t reliably outperform simple baselines under standardized conditions; methodological rigor matters more than architectural novelty.

Details

Motivation: To challenge the assumption that complex multimodal-specific architectures inherently outperform simpler approaches, and to address methodological shortcomings in multimodal learning research.

Method: Large-scale empirical study reimplementing 19 high-impact methods under standardized conditions, evaluating across 9 diverse datasets with up to 23 modalities, testing generalizability to new tasks and missing modalities, and proposing SimBaMM (Simple Baseline for Multimodal Learning) - a late-fusion Transformer architecture.

Result: Under standardized conditions with rigorous hyperparameter tuning, complex architectures do not reliably outperform SimBaMM. More complex methods perform comparably to SimBaMM and frequently don’t outperform well-tuned unimodal baselines, especially in small-data regimes.

Conclusion: Research should shift focus from architectural novelty to methodological rigor; the paper provides a reliability checklist for more comparable, robust, and trustworthy evaluations in multimodal learning.

Abstract: Deep learning architectures for multimodal learning have increased in complexity, driven by the assumption that multimodal-specific methods improve performance. We challenge this assumption through a large-scale empirical study reimplementing 19 high-impact methods under standardized conditions, evaluating them across nine diverse datasets with up to 23 modalities, and testing their generalizability to new tasks beyond their original scope, including settings with missing modalities. We propose a Simple Baseline for Multimodal Learning (SimBaMM), a straightforward late-fusion Transformer architecture, and demonstrate that under standardized experimental conditions with rigorous hyperparameter tuning of all methods, more complex architectures do not reliably outperform SimBaMM. Statistical analysis indicates that more complex methods perform comparably to SimBaMM and frequently do not reliably outperform well-tuned unimodal baselines, especially in the small-data regime considered in many original studies. To support our findings, we include a case study of a recent multimodal learning method highlighting the methodological shortcomings in the literature. In addition, we provide a pragmatic reliability checklist to promote comparable, robust, and trustworthy future evaluations. In summary, we argue for a shift in focus: away from the pursuit of architectural novelty and toward methodological rigor.

[562] Merge before Forget: A Single LoRA Continual Learning via Continual Merging

Fuli Qiao, Mehrdad Mahdavi

Main category: cs.LG

TL;DR: Proposes a novel continual learning method for LLMs that orthogonally initializes and sequentially merges LoRA updates into a single unified LoRA, maintaining constant memory complexity and minimizing task interference.

Details

Motivation: Current LoRA continual learning methods have limitations: they retain frozen LoRAs or generate data representations, leading to growing computational memory, limited storage, and potential task interference due to lack of effective merging mechanisms.

Method: Orthogonal basis extraction from previous LoRA to initialize new task learning, sequential merging of LoRA updates into a single unified LoRA, and time-aware scaling mechanism to balance new/old knowledge using intrinsic asymmetry property of LoRA components.

Result: Maintains constant memory complexity with task count, minimizes interference via orthogonal initialization, improves performance through adaptive asymmetric LoRA merging, validated across diverse benchmarks with Llama models.

Conclusion: The proposed method effectively addresses memory growth and task interference issues in continual learning, demonstrating superior efficiency and performance through theoretical analysis and extensive experiments.

Abstract: Parameter-efficient continual learning has emerged as a promising approach for large language models (LLMs) to mitigate catastrophic forgetting while enabling adaptation to new tasks. Current Low-Rank Adaptation (LoRA) continual learning techniques often retain and freeze previously learned LoRAs or generate data representations to overcome forgetting, typically utilizing these to support new LoRAs learn new tasks. However, these methods not only ignore growing computational memory with tasks and limited storage space but also suffer from potential task interference due to the lack of effective LoRA merging mechanisms. In this paper, we propose a novel continual learning method that orthogonally initializes and sequentially merges LoRAs updates into a single unified LoRA. Our method leverages orthogonal basis extraction from previously learned LoRA to initialize the learning of new tasks, further exploits the intrinsic asymmetry property of LoRA components by using a time-aware scaling mechanism to balance new and old knowledge during continual merging. Our approach maintains constant memory complexity with respect to the number of tasks, minimizes interference between past and new tasks via orthogonal basis initialization, and improves performance over asymmetric LoRA merging via adaptive scaling. We provide theoretical analysis to justify our design and conduct extensive experiments across diverse continual learning benchmarks using various Llama models, demonstrating the effectiveness and efficiency of our method.

[563] Mechanistic Analysis of Circuit Preservation in Federated Learning

Muhammad Haseeb, Salaar Masood, Muhammad Abdullah Sohail

Main category: cs.LG

TL;DR: The paper uses mechanistic interpretability to show that Non-IID data in federated learning causes conflicting client updates to destroy specialized neural circuits, leading to accuracy degradation.

Details

Motivation: While FL performance degradation under Non-IID data is well-known, the internal mechanistic causes remain a black box. The paper aims to diagnose this failure mode at a circuit level rather than just observing statistical effects.

Method: Train inherently interpretable, weight-sparse neural networks within FL framework, track circuits across clients and rounds using Intersection-over-Union (IoU) to quantify circuit preservation, and analyze FedAvg through mechanistic interpretability lens.

Result: Provides first mechanistic evidence that Non-IID data distributions cause structurally distinct local circuits to diverge and degrade in the global model, showing circuit collapse through destructive interference of functional sub-networks.

Conclusion: Reframes statistical drift in FL as concrete, observable failure of mechanistic preservation, paving way for more targeted solutions by understanding circuit-level degradation rather than just statistical effects.

Abstract: Federated Learning (FL) enables collaborative training of models on decentralized data, but its performance degrades significantly under Non-IID (non-independent and identically distributed) data conditions. While this accuracy loss is well-documented, the internal mechanistic causes remain a black box. This paper investigates the canonical FedAvg algorithm through the lens of Mechanistic Interpretability (MI) to diagnose this failure mode. We hypothesize that the aggregation of conflicting client updates leads to circuit collapse, the destructive interference of functional, sparse sub-networks responsible for specific class predictions. By training inherently interpretable, weight-sparse neural networks within an FL framework, we identify and track these circuits across clients and communication rounds. Using Intersection-over-Union (IoU) to quantify circuit preservation, we provide the first mechanistic evidence that Non-IID data distributions cause structurally distinct local circuits to diverge, leading to their degradation in the global model. Our findings reframe the problem of statistical drift in FL as a concrete, observable failure of mechanistic preservation, paving the way for more targeted solutions.

[564] Trust Region Masking for Long-Horizon LLM Reinforcement Learning

Yingru Li, Jiacai Liu, Jiawei Xu, Yuxuan Tong, Ziniu Li, Baoxiang Wang

Main category: cs.LG

TL;DR: The paper addresses the approximation error in policy gradient methods for large language models when using off-policy rollouts, deriving tighter bounds that scale better with sequence length and proposing Trust Region Masking to ensure monotonic improvement guarantees.

Details

Motivation: Off-policy mismatch between rollout policy and target policy in LLM-RL is unavoidable due to implementation divergence, mixture-of-experts routing discontinuities, and distributed training staleness. Classical trust region bounds scale poorly (O(T²)) with sequence length, making them vacuous for long-horizon tasks.

Method: Derived two tighter bounds: Pinsker-Marginal bound (O(T³/²)) and Mixed bound (O(T)), both depending on maximum token-level KL divergence across all positions. Proposed Trust Region Masking (TRM) which excludes entire sequences from gradient computation if any token violates the trust region.

Result: The new bounds provide non-vacuous guarantees for long-horizon tasks, with TRM offering the first non-vacuous monotonic improvement guarantees for long-horizon LLM-RL by controlling sequence-level KL divergence.

Conclusion: Token-independent methods like PPO clipping cannot control sequence-level KL divergence required for tight bounds. TRM addresses this limitation by excluding problematic sequences, enabling reliable policy improvement in long-horizon LLM reinforcement learning.

Abstract: Policy gradient methods for large language models optimize a surrogate objective computed from samples of a rollout policy $π_{\text{roll}}$. When $π_{\text{roll}} \ne π_θ$, there is approximation error between the surrogate and the true objective. Prior work has shown that this off-policy mismatch is unavoidable in modern LLM-RL due to implementation divergence, mixture-of-experts routing discontinuities, and distributed training staleness. Classical trust region bounds on the resulting error scale as $O(T^2)$ with sequence length $T$, rendering them vacuous for long-horizon tasks. We derive two tighter bounds: a Pinsker-Marginal bound scaling as $O(T^{3/2})$ and a Mixed bound scaling as $O(T)$. Crucially, both bounds depend on $D_{kl}^{tok,max}$ – the maximum token-level KL divergence across all positions in a sequence. This is inherently a sequence-level quantity: it requires examining the entire trajectory to compute, and therefore cannot be controlled by token-independent methods like PPO clipping. We propose Trust Region Masking (TRM), which excludes entire sequences from gradient computation if any token violates the trust region, providing the first non-vacuous monotonic improvement guarantees for long-horizon LLM-RL.

[565] PI-MFM: Physics-informed multimodal foundation model for solving partial differential equations

Min Zhu, Jingmin Sun, Zecheng Zhang, Hayden Schaeffer, Lu Lu

Main category: cs.LG

TL;DR: PI-MFM is a physics-informed multimodal foundation model that enforces PDE governing equations during training, enabling data-efficient learning of PDE solution operators across diverse equation families with improved performance over purely data-driven approaches.

Details

Motivation: Existing multi-operator learning approaches for PDEs are data-hungry and neglect physics during training, limiting their practical application and data efficiency.

Method: PI-MFM takes symbolic PDE representations as input, automatically assembles PDE residual losses via vectorized derivative computation, and enables unified physics-informed training across equation families through physics losses during pretraining and adaptation.

Result: PI-MFM outperforms data-driven counterparts on 13 parametric 1D time-dependent PDE families, especially with sparse data, partial observations, or few labeled pairs. Physics losses improve noise robustness, and zero-shot fine-tuning to unseen PDE families reduces test errors to ~1% without labeled solution data.

Conclusion: PI-MFM provides a practical, scalable path toward data-efficient, transferable PDE solvers by integrating physics directly into multimodal foundation model training, enabling better performance with less data and improved generalization to unseen PDE families.

Abstract: Partial differential equations (PDEs) govern a wide range of physical systems, and recent multimodal foundation models have shown promise for learning PDE solution operators across diverse equation families. However, existing multi-operator learning approaches are data-hungry and neglect physics during training. Here, we propose a physics-informed multimodal foundation model (PI-MFM) framework that directly enforces governing equations during pretraining and adaptation. PI-MFM takes symbolic representations of PDEs as the input, and automatically assembles PDE residual losses from the input expression via a vectorized derivative computation. These designs enable any PDE-encoding multimodal foundation model to be trained or adapted with unified physics-informed objectives across equation families. On a benchmark of 13 parametric one-dimensional time-dependent PDE families, PI-MFM consistently outperforms purely data-driven counterparts, especially with sparse labeled spatiotemporal points, partially observed time domains, or few labeled function pairs. Physics losses further improve robustness against noise, and simple strategies such as resampling collocation points substantially improve accuracy. We also analyze the accuracy, precision, and computational cost of automatic differentiation and finite differences for derivative computation within PI-MFM. Finally, we demonstrate zero-shot physics-informed fine-tuning to unseen PDE families: starting from a physics-informed pretrained model, adapting using only PDE residuals and initial/boundary conditions, without any labeled solution data, rapidly reduces test errors to around 1% and clearly outperforms physics-only training from scratch. These results show that PI-MFM provides a practical and scalable path toward data-efficient, transferable PDE solvers.

[566] Multimodal Functional Maximum Correlation for Emotion Recognition

Deyang Zheng, Tianyi Zhang, Wenming Zheng, Shujian Yu

Main category: cs.LG

TL;DR: MFMC is a self-supervised learning framework that captures higher-order multimodal dependencies in affective computing using Dual Total Correlation, outperforming pairwise alignment methods on emotion recognition benchmarks.

Details

Motivation: Emotional states involve complex, coordinated physiological responses across multiple modalities (central and autonomic systems), but existing self-supervised learning approaches rely on insufficient pairwise alignment objectives that fail to capture higher-order interactions and dependencies among more than two modalities.

Method: Proposes Multimodal Functional Maximum Correlation (MFMC) framework that maximizes higher-order multimodal dependence through a Dual Total Correlation objective. Uses a tight sandwich bound and optimizes it with functional maximum correlation analysis (FMCA) based trace surrogate to capture joint multimodal interactions directly, avoiding pairwise contrastive losses.

Result: Achieves state-of-the-art or competitive performance on three public affective computing benchmarks under both subject-dependent and subject-independent protocols. Improves subject-dependent accuracy on CEAP-360VR from 78.9% to 86.8%, and subject-independent accuracy from 27.5% to 33.1% using EDA signal alone. Remains within 0.8 percentage points of best method on challenging EEG subject-independent split of MAHNOB-HCI.

Conclusion: MFMC effectively captures higher-order multimodal dependencies in affective computing, demonstrating robustness to inter-subject variability and outperforming pairwise alignment methods, making it a promising approach for learning joint physiological dynamics in emotion recognition.

Abstract: Emotional states manifest as coordinated yet heterogeneous physiological responses across central and autonomic systems, posing a fundamental challenge for multimodal representation learning in affective computing. Learning such joint dynamics is further complicated by the scarcity and subjectivity of affective annotations, which motivates the use of self-supervised learning (SSL). However, most existing SSL approaches rely on pairwise alignment objectives, which are insufficient to characterize dependencies among more than two modalities and fail to capture higher-order interactions arising from coordinated brain and autonomic responses. To address this limitation, we propose Multimodal Functional Maximum Correlation (MFMC), a principled SSL framework that maximizes higher-order multimodal dependence through a Dual Total Correlation (DTC) objective. By deriving a tight sandwich bound and optimizing it using a functional maximum correlation analysis (FMCA) based trace surrogate, MFMC captures joint multimodal interactions directly, without relying on pairwise contrastive losses. Experiments on three public affective computing benchmarks demonstrate that MFMC consistently achieves state-of-the-art or competitive performance under both subject-dependent and subject-independent evaluation protocols, highlighting its robustness to inter-subject variability. In particular, MFMC improves subject-dependent accuracy on CEAP-360VR from 78.9% to 86.8%, and subject-independent accuracy from 27.5% to 33.1% using the EDA signal alone. Moreover, MFMC remains within 0.8 percentage points of the best-performing method on the most challenging EEG subject-independent split of MAHNOB-HCI. Our code is available at https://github.com/DY9910/MFMC.

[567] Breaking the Memory Wall: Exact Analytical Differentiation via Tiled Operator-Space Evolution

Shuhuan Wang, Yuzhen Xie, Jiayi Li, Yinliang Diao

Main category: cs.LG

TL;DR: PGF enables O(1) memory gradient computation for SSMs, allowing genomic-scale modeling on consumer GPUs by avoiding intermediate graph materialization.

Details

Motivation: Current SSMs suffer from O(L) memory scaling during backpropagation, preventing genomic-scale modeling (L > 10^5) on consumer hardware due to memory constraints.

Method: Phase Gradient Flow (PGF) computes exact analytical derivatives by operating in state-space manifold directly, using Tiled Operator-Space Evolution (TOSE) to bypass intermediate computational graph materialization.

Result: 94% reduction in peak VRAM, 23x throughput increase vs standard Autograd, stable near-machine precision across extreme sequences (128,000-step benchmark), enables chromosome-scale sensitivity analysis on single GPU.

Conclusion: PGF bridges gap between theoretical infinite-context SSMs and practical hardware limitations by enabling O(1) memory gradient computation for genomic-scale modeling.

Abstract: Selective State Space Models (SSMs) achieve linear-time inference, yet their gradient-based sensitivity analysis remains bottlenecked by O(L) memory scaling during backpropagation. This memory constraint precludes genomic-scale modeling (L > 10^5) on consumer-grade hardware. We introduce Phase Gradient Flow (PGF), a framework that computes exact analytical derivatives by operating directly in the state-space manifold, bypassing the need to materialize the intermediate computational graph. By reframing SSM dynamics as Tiled Operator-Space Evolution (TOSE), our method delivers O(1) memory complexity relative to sequence length, yielding a 94% reduction in peak VRAM and a 23x increase in throughput compared to standard Autograd. Unlike parallel prefix scans that exhibit numerical divergence in stiff ODE regimes, PGF ensures stability through invariant error scaling, maintaining near-machine precision across extreme sequences. We demonstrate the utility of PGF on an impulse-response benchmark with 128,000-step sequences - a scale where conventional Autograd encounters prohibitive memory overhead, often leading to out-of-memory (OOM) failures in multi-layered models. Our work enables chromosome-scale sensitivity analysis on a single GPU, bridging the gap between theoretical infinite-context models and practical hardware limitations.

[568] FLEX-MoE: Federated Mixture-of-Experts with Load-balanced Expert Assignment

Boyang Zhang, Xiaobing Chen, Songyang Zhang, Shuai Zhang, Xiangwei Zhou, Mingxuan Sun

Main category: cs.LG

TL;DR: FLEX-MoE is a federated learning framework for Mixture-of-Experts models that optimizes expert assignment and load balancing under client capacity constraints, addressing resource limitations and non-IID data challenges in edge devices.

Details

Motivation: Two main challenges in deploying MoE models with federated learning: 1) resource-constrained edge devices cannot store full expert sets, and 2) non-IID data distributions cause severe expert load imbalance that degrades model performance.

Method: Introduces client-expert fitness scores that quantify expert suitability for local datasets through training feedback, and employs an optimization-based algorithm to maximize client-expert specialization while enforcing balanced expert utilization system-wide.

Result: Comprehensive experiments on three different datasets demonstrate superior performance of FLEX-MoE, together with its ability to maintain balanced expert utilization across diverse resource-constrained scenarios.

Conclusion: FLEX-MoE effectively addresses both expert assignment optimization and load balancing challenges in federated MoE settings, outperforming existing greedy methods that focus only on personalization while ignoring load imbalance issues.

Abstract: Mixture-of-Experts (MoE) models enable scalable neural networks through conditional computation. However, their deployment with federated learning (FL) faces two critical challenges: 1) resource-constrained edge devices cannot store full expert sets, and 2) non-IID data distributions cause severe expert load imbalance that degrades model performance. To this end, we propose \textbf{FLEX-MoE}, a novel federated MoE framework that jointly optimizes expert assignment and load balancing under limited client capacity. Specifically, our approach introduces client-expert fitness scores that quantify the expert suitability for local datasets through training feedback, and employs an optimization-based algorithm to maximize client-expert specialization while enforcing balanced expert utilization system-wide. Unlike existing greedy methods that focus solely on personalization while ignoring load imbalance, our FLEX-MoE is capable of addressing the expert utilization skew, which is particularly severe in FL settings with heterogeneous data. Our comprehensive experiments on three different datasets demonstrate the superior performance of the proposed FLEX-MoE, together with its ability to maintain balanced expert utilization across diverse resource-constrained scenarios.

[569] Taming the Tail: Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning

Yingru Li, Jiawei Xu, Jiacai Liu, Yuxuan Tong, Ziniu Li, Tianle Cai, Ge Zhang, Qian Liu, Baoxiang Wang

Main category: cs.LG

TL;DR: RL for LLMs suffers from training-inference mismatch due to different probability distributions in training vs inference systems. The mismatch affects low-probability tokens most, causing systematic biases that destabilize training. Solution: dynamically prune extreme tail tokens from vocabulary during RL to trade large biased mismatches for small bounded optimization bias.

Details

Motivation: Reinforcement learning for LLMs faces a fundamental tension between high-throughput inference engines and numerically-precise training systems that produce different probability distributions from the same parameters, creating a training-inference mismatch that destabilizes gradient estimation.

Method: Propose constraining the RL objective to a dynamically-pruned “safe” vocabulary that excludes extreme tail tokens. This trades large, systematically biased mismatches for a small, bounded optimization bias. The method involves pruning low-probability tokens that contribute most to the mismatch problem.

Result: Theoretical analysis shows the mismatch has asymmetric effect: bound on log-probability mismatch scales as (1-p) where p is token probability. Low-probability tokens exhibit systematically biased mismatches that accumulate over sequences. Empirically, the method achieves stable training.

Conclusion: Rather than applying post-hoc corrections, dynamically pruning the vocabulary tail during RL training addresses the training-inference mismatch problem by eliminating the most problematic tokens, achieving stable training with bounded optimization bias.

Abstract: Reinforcement learning for large language models (LLMs) faces a fundamental tension: high-throughput inference engines and numerically-precise training systems produce different probability distributions from the same parameters, creating a training-inference mismatch. We prove this mismatch has an asymmetric effect: the bound on log-probability mismatch scales as $(1-p)$ where $p$ is the token probability. For high-probability tokens, this bound vanishes, contributing negligibly to sequence-level mismatch. For low-probability tokens in the tail, the bound remains large, and moreover, when sampled, these tokens exhibit systematically biased mismatches that accumulate over sequences, destabilizing gradient estimation. Rather than applying post-hoc corrections, we propose constraining the RL objective to a dynamically-pruned ``safe’’ vocabulary that excludes the extreme tail. By pruning such tokens, we trade large, systematically biased mismatches for a small, bounded optimization bias. Empirically, our method achieves stable training; theoretically, we bound the optimization bias introduced by vocabulary pruning.

[570] Rethinking Fine-Tuning: Unlocking Hidden Capabilities in Vision-Language Models

Mingyuan Zhang, Yue Bai, Yifan Wang, Yiyang Huang, Yun Fu

Main category: cs.LG

TL;DR: MFT (Mask Fine-Tuning) applied to VLMs achieves better performance than LoRA and full fine-tuning by learning gating scores instead of updating weights, enabling structural reparameterization of existing knowledge.

Details

Motivation: Current fine-tuning approaches for Vision-Language Models (VLMs) rely on explicit weight updates, overlooking the extensive representational structures already encoded in pre-trained models that remain underutilized. The authors want to explore whether effective adaptation can emerge from reorganizing existing knowledge rather than updating weights.

Method: Apply Mask Fine-Tuning (MFT) to VLMs, specifically targeting the language and projector components. Instead of updating weights, MFT assigns learnable gating scores to each weight, allowing the model to reorganize its internal subnetworks for downstream task adaptation. This approach is grounded in structural reparameterization perspective.

Result: MFT consistently surpasses LoRA variants and even full fine-tuning, achieving high performance without altering the frozen backbone. Experiments were conducted with different language backbones and compared against strong PEFT baselines.

Conclusion: Effective adaptation for VLMs can emerge not only from updating weights but also from reestablishing connections among the model’s existing knowledge. MFT provides a powerful and efficient post-training paradigm that reorganizes internal subnetworks rather than modifying weights.

Abstract: Explorations in fine-tuning Vision-Language Models (VLMs), such as Low-Rank Adaptation (LoRA) from Parameter Efficient Fine-Tuning (PEFT), have made impressive progress. However, most approaches rely on explicit weight updates, overlooking the extensive representational structures already encoded in pre-trained models that remain underutilized. Recent works have demonstrated that Mask Fine-Tuning (MFT) can be a powerful and efficient post-training paradigm for language models. Instead of updating weights, MFT assigns learnable gating scores to each weight, allowing the model to reorganize its internal subnetworks for downstream task adaptation. In this paper, we rethink fine-tuning for VLMs from a structural reparameterization perspective grounded in MFT. We apply MFT to the language and projector components of VLMs with different language backbones and compare against strong PEFT baselines. Experiments show that MFT consistently surpasses LoRA variants and even full fine-tuning, achieving high performance without altering the frozen backbone. Our findings reveal that effective adaptation can emerge not only from updating weights but also from reestablishing connections among the model’s existing knowledge. Code available at: https://github.com/Ming-K9/MFT-VLM

[571] Osmotic Learning: A Self-Supervised Paradigm for Decentralized Contextual Data Representation

Mario Colosi, Reza Farahani, Maria Fazio, Radu Prodan, Massimo Villari

Main category: cs.LG

TL;DR: OSM-L is a self-supervised distributed learning paradigm that extracts higher-level latent knowledge from distributed data without raw data exchange, using osmosis to synthesize compact representations and identify correlated data groups.

Details

Motivation: Data in distributed systems has deeper significance when viewed in context, with interdependent sources revealing hidden relationships and latent structures that are valuable for applications. Current approaches may require raw data exchange or fail to capture contextual patterns effectively.

Method: OSM-L uses osmosis to synthesize dense, compact representations by extracting contextual information without raw data exchange. It iteratively aligns local data representations, enabling information diffusion and convergence into dynamic equilibrium. The method also identifies correlated data groups as a decentralized clustering mechanism.

Result: Experimental results show OSM-L achieves convergence and strong representation capabilities on structured datasets, with over 0.99 accuracy in local information alignment while preserving contextual integrity.

Conclusion: OSM-L successfully uncovers higher-level latent knowledge from distributed data through osmotic learning, providing an effective self-supervised paradigm for distributed systems that maintains privacy through representation exchange rather than raw data sharing.

Abstract: Data within a specific context gains deeper significance beyond its isolated interpretation. In distributed systems, interdependent data sources reveal hidden relationships and latent structures, representing valuable information for many applications. This paper introduces Osmotic Learning (OSM-L), a self-supervised distributed learning paradigm designed to uncover higher-level latent knowledge from distributed data. The core of OSM-L is osmosis, a process that synthesizes dense and compact representation by extracting contextual information, eliminating the need for raw data exchange between distributed entities. OSM-L iteratively aligns local data representations, enabling information diffusion and convergence into a dynamic equilibrium that captures contextual patterns. During training, it also identifies correlated data groups, functioning as a decentralized clustering mechanism. Experimental results confirm OSM-L’s convergence and representation capabilities on structured datasets, achieving over 0.99 accuracy in local information alignment while preserving contextual integrity.

[572] How Much Data Is Enough? Uniform Convergence Bounds for Generative & Vision-Language Models under Low-Dimensional Structure

Paul M. Thompson

Main category: cs.LG

TL;DR: Finite-sample uniform convergence analysis for generative/VLM-based predictors in biomedical applications, showing when smooth dependence on low-dimensional semantic representations enables uniformly accurate and calibrated predictions with practical sample sizes.

Details

Motivation: Generative and vision-language models are increasingly used in scientific/medical decision support where predictions must be accurate and well-calibrated. However, it's unclear when these predictions generalize uniformly across inputs, classes, or subpopulations rather than just on average - a critical issue in biomedicine where rare conditions and specific groups can have large errors even with low overall loss.

Method: Focus on induced families of classifiers obtained by varying prompts or semantic embeddings within restricted representation spaces. Analyze when model outputs depend smoothly on low-dimensional semantic representations (supported by spectral structure in text and joint image-text embeddings). Apply classical uniform convergence tools to derive finite-sample uniform convergence bounds for accuracy and calibration functionals under Lipschitz stability with respect to prompt embeddings.

Result: Main results provide finite-sample uniform convergence bounds for VLM-induced classifiers. Sample complexity depends on intrinsic/effective dimension rather than ambient embedding dimension. Spectrum-dependent bounds show how eigenvalue decay governs data requirements. Analysis reveals when current dataset sizes can support uniformly reliable predictions and why average calibration metrics may miss worst-case miscalibration.

Conclusion: Under structural assumptions of smooth dependence on low-dimensional semantic representations, generative and VLM-based predictors can achieve uniformly accurate and calibrated behavior with practical sample sizes. This has important implications for data-limited biomedical modeling, particularly for ensuring reliable predictions across rare conditions and specific subpopulations.

Abstract: Modern generative and vision-language models (VLMs) are increasingly used in scientific and medical decision support, where predicted probabilities must be both accurate and well calibrated. Despite strong empirical results with moderate data, it remains unclear when such predictions generalize uniformly across inputs, classes, or subpopulations, rather than only on average-a critical issue in biomedicine, where rare conditions and specific groups can exhibit large errors even when overall loss is low. We study this question from a finite-sample perspective and ask: under what structural assumptions can generative and VLM-based predictors achieve uniformly accurate and calibrated behavior with practical sample sizes? Rather than analyzing arbitrary parameterizations, we focus on induced families of classifiers obtained by varying prompts or semantic embeddings within a restricted representation space. When model outputs depend smoothly on a low-dimensional semantic representation-an assumption supported by spectral structure in text and joint image-text embeddings-classical uniform convergence tools yield meaningful non-asymptotic guarantees. Our main results give finite-sample uniform convergence bounds for accuracy and calibration functionals of VLM-induced classifiers under Lipschitz stability with respect to prompt embeddings. The implied sample complexity depends on intrinsic/effective dimension, not ambient embedding dimension, and we further derive spectrum-dependent bounds that make explicit how eigenvalue decay governs data requirements. We conclude with implications for data-limited biomedical modeling, including when current dataset sizes can support uniformly reliable predictions and why average calibration metrics may miss worst-case miscalibration.

[573] SE-MLP Model for Predicting Prior Acceleration Features in Penetration Signals

Yankang Li, Changsheng Li

Main category: cs.LG

TL;DR: SE-MLP model with channel attention and residual connections enables rapid prediction of penetration acceleration features, outperforming conventional models and providing practical engineering solutions.

Details

Motivation: Traditional methods for obtaining penetration acceleration features require long simulation cycles and expensive computations, creating a need for faster prediction methods for engineering applications.

Method: Proposes SE-MLP (squeeze and excitation multi-layer perceptron) that integrates channel attention mechanism with residual connections to establish nonlinear mapping between physical parameters and penetration characteristics.

Result: SE-MLP achieves superior prediction accuracy, generalization, and stability compared to conventional MLP, XGBoost, and Transformer models, with discrepancies within acceptable engineering tolerances.

Conclusion: The method validates feasibility and engineering applicability for rapidly generating prior feature values for penetration fuzes, with both channel attention and residual structures contributing to performance gains.

Abstract: Accurate identification of the penetration process relies heavily on prior feature values of penetration acceleration. However, these feature values are typically obtained through long simulation cycles and expensive computations. To overcome this limitation, this paper proposes a multi-layer Perceptron architecture, termed squeeze and excitation multi-layer perceptron (SE-MLP), which integrates a channel attention mechanism with residual connections to enable rapid prediction of acceleration feature values. Using physical parameters under different working conditions as inputs, the model outputs layer-wise acceleration features, thereby establishing a nonlinear mapping between physical parameters and penetration characteristics. Comparative experiments against conventional MLP, XGBoost, and Transformer models demonstrate that SE-MLP achieves superior prediction accuracy, generalization, and stability. Ablation studies further confirm that both the channel attention module and residual structure contribute significantly to performance gains. Numerical simulations and range recovery tests show that the discrepancies between predicted and measured acceleration peaks and pulse widths remain within acceptable engineering tolerances. These results validate the feasibility and engineering applicability of the proposed method and provide a practical basis for rapidly generating prior feature values for penetration fuzes.

[574] Principled Algorithms for Optimizing Generalized Metrics in Binary Classification

Anqi Mao, Mehryar Mohri, Yutao Zhong

Main category: cs.LG

TL;DR: The paper introduces METRO, a principled framework for optimizing generalized classification metrics (like Fβ, AM, Jaccard) with theoretical guarantees, addressing limitations of existing threshold-based methods.

Details

Motivation: Standard binary classification metrics are inadequate for imbalanced or cost-sensitive applications. Existing approaches for optimizing generalized metrics lack theoretical guarantees, are not tailored to restricted hypothesis sets, and rely on suboptimal threshold-based methods.

Method: Reformulates metric optimization as generalized cost-sensitive learning, designs novel surrogate loss functions with H-consistency guarantees, and develops the METRO algorithm with theoretical performance guarantees.

Result: Developed principled algorithms with H-consistency and finite-sample generalization bounds. Experimental results demonstrate METRO’s effectiveness compared to prior baselines.

Conclusion: The METRO framework provides a theoretically sound approach to optimizing generalized classification metrics, overcoming limitations of existing methods and offering strong performance guarantees.

Abstract: In applications with significant class imbalance or asymmetric costs, metrics such as the $F_β$-measure, AM measure, Jaccard similarity coefficient, and weighted accuracy offer more suitable evaluation criteria than standard binary classification loss. However, optimizing these metrics present significant computational and statistical challenges. Existing approaches often rely on the characterization of the Bayes-optimal classifier, and use threshold-based methods that first estimate class probabilities and then seek an optimal threshold. This leads to algorithms that are not tailored to restricted hypothesis sets and lack finite-sample performance guarantees. In this work, we introduce principled algorithms for optimizing generalized metrics, supported by $H$-consistency and finite-sample generalization bounds. Our approach reformulates metric optimization as a generalized cost-sensitive learning problem, enabling the design of novel surrogate loss functions with provable $H$-consistency guarantees. Leveraging this framework, we develop new algorithms, METRO (Metric Optimization), with strong theoretical performance guarantees. We report the results of experiments demonstrating the effectiveness of our methods compared to prior baselines.

[575] A Weak Signal Learning Dataset and Its Baseline Method

Xianqi Liu, Xiangru Li, Lefeng He, Ziyu Fang

Main category: cs.LG

TL;DR: The paper introduces the first specialized dataset for weak signal learning (WSL) with 13,158 spectral samples featuring low SNR and extreme class imbalance, and proposes a PDVFN model with dual-view representation that achieves better accuracy and robustness in handling weak signals, noise, and imbalance.

Details

Motivation: Weak signal learning is crucial in many fields (fault diagnosis, medical imaging, autonomous driving) but challenging due to noise masking critical information. Even with strong signals, extracting weak signals is key to improving performance. The lack of dedicated datasets has constrained WSL research.

Method: Constructed first specialized WSL dataset with 13,158 spectral samples featuring low SNR dominance (over 55% samples with SNR below 50) and extreme class imbalance (ratio up to 29:1). Proposed PDVFN model with dual-view representation (vector + time-frequency map) that extracts local sequential features and global frequency-domain structures in parallel, following principles of local enhancement, sequential modeling, noise suppression, multi-scale capture, frequency extraction, and global perception.

Result: PDVFN achieves higher accuracy and robustness in handling weak signals, high noise, and extreme class imbalance, especially in low SNR and imbalanced scenarios. The method shows effectiveness in WSL tasks like astronomical spectroscopy.

Conclusion: This study provides a dedicated dataset, baseline model, and establishes foundation for future WSL research. The multi-source complementarity approach enhances representation for low-SNR and imbalanced data, offering a novel solution for WSL tasks.

Abstract: Weak signal learning (WSL) is a common challenge in many fields like fault diagnosis, medical imaging, and autonomous driving, where critical information is often masked by noise and interference, making feature identification difficult. Even in tasks with abundant strong signals, the key to improving model performance often lies in effectively extracting weak signals. However, the lack of dedicated datasets has long constrained research. To address this, we construct the first specialized dataset for weak signal feature learning, containing 13,158 spectral samples. It features low SNR dominance (over 55% samples with SNR below 50) and extreme class imbalance (class ratio up to 29:1), providing a challenging benchmark for classification and regression in weak signal scenarios. We also propose a dual-view representation (vector + time-frequency map) and a PDVFN model tailored to low SNR, distribution skew, and dual imbalance. PDVFN extracts local sequential features and global frequency-domain structures in parallel, following principles of local enhancement, sequential modeling, noise suppression, multi-scale capture, frequency extraction, and global perception. This multi-source complementarity enhances representation for low-SNR and imbalanced data, offering a novel solution for WSL tasks like astronomical spectroscopy. Experiments show our method achieves higher accuracy and robustness in handling weak signals, high noise, and extreme class imbalance, especially in low SNR and imbalanced scenarios. This study provides a dedicated dataset, a baseline model, and establishes a foundation for future WSL research.

[576] Diffusion-based Decentralized Federated Multi-Task Representation Learning

Donghwa Kang, Shana Moothedath

Main category: cs.LG

TL;DR: Decentralized algorithm for multi-task linear regression with shared low-dimensional representation using projected gradient descent in diffusion-based networks.

Details

Motivation: Representation learning is effective for data-scarce environments but decentralized approaches remain underexplored. Need for algorithms that can learn shared representations across tasks in decentralized/federated settings without central coordination.

Method: Developed a decentralized projected gradient descent-based alternating minimization algorithm for multi-task linear regression. Uses diffusion-based decentralized and federated approach to recover low-rank feature matrix. Combines projected gradient descent with minimization steps.

Result: Provides constructive provable guarantees: lower bound on sample complexity and upper bound on iteration complexity. Algorithm is fast and communication-efficient with validated performance through numerical simulations compared to benchmarks.

Conclusion: Proposed algorithm successfully addresses decentralized multi-task representation learning with theoretical guarantees and practical efficiency, filling a gap in decentralized representation learning research.

Abstract: Representation learning is a widely adopted framework for learning in data-scarce environments to obtain a feature extractor or representation from various different yet related tasks. Despite extensive research on representation learning, decentralized approaches remain relatively underexplored. This work develops a decentralized projected gradient descent-based algorithm for multi-task representation learning. We focus on the problem of multi-task linear regression in which multiple linear regression models share a common, low-dimensional linear representation. We present an alternating projected gradient descent and minimization algorithm for recovering a low-rank feature matrix in a diffusion-based decentralized and federated fashion. We obtain constructive, provable guarantees that provide a lower bound on the required sample complexity and an upper bound on the iteration complexity of our proposed algorithm. We analyze the time and communication complexity of our algorithm and show that it is fast and communication-efficient. We performed numerical simulations to validate the performance of our algorithm and compared it with benchmark algorithms.

[577] Evaluating Parameter Efficient Methods for RLVR

Qingyu Yin, Yulun Wu, Zhennan Shen, Sunbowen Li, Zhilin Wang, Yanshu Li, Chak Tou Leong, Jiale Kang, Jinjin Gu

Main category: cs.LG

TL;DR: PEFT methods evaluation in RLVR shows structural variants (DoRA, AdaLoRA, MiSS) outperform standard LoRA, SVD-based methods fail due to spectral collapse, and extreme parameter reduction harms reasoning.

Details

Motivation: While PEFT methods like LoRA are commonly used in RLVR (Reinforcement Learning with Verifiable Rewards), the optimal PEFT architecture for RLVR remains unknown, requiring systematic evaluation to guide method selection.

Method: Comprehensive evaluation of over 12 PEFT methodologies on DeepSeek-R1-Distill models across mathematical reasoning benchmarks, including structural variants, SVD-informed methods, and extreme parameter reduction techniques.

Result: 1) Structural variants (DoRA, AdaLoRA, MiSS) consistently outperform standard LoRA. 2) SVD-informed methods (PiSSA, MiLoRA) fail due to spectral collapse from misalignment between principal-component updates and RL optimization. 3) Extreme parameter reduction (VeRA, Rank-1) severely bottlenecks reasoning capacity.

Conclusion: The default adoption of standard LoRA for RLVR is suboptimal; structural variants perform better, while SVD-based methods fail and extreme parameter reduction harms performance, calling for more exploration of parameter-efficient RL methods.

Abstract: We systematically evaluate Parameter-Efficient Fine-Tuning (PEFT) methods under the paradigm of Reinforcement Learning with Verifiable Rewards (RLVR). RLVR incentivizes language models to enhance their reasoning capabilities through verifiable feedback; however, while methods like LoRA are commonly used, the optimal PEFT architecture for RLVR remains unidentified. In this work, we conduct the first comprehensive evaluation of over 12 PEFT methodologies across the DeepSeek-R1-Distill families on mathematical reasoning benchmarks. Our empirical results challenge the default adoption of standard LoRA with three main findings. First, we demonstrate that structural variants, such as DoRA, AdaLoRA, and MiSS, consistently outperform LoRA. Second, we uncover a spectral collapse phenomenon in SVD-informed initialization strategies (\textit{e.g.,} PiSSA, MiLoRA), attributing their failure to a fundamental misalignment between principal-component updates and RL optimization. Furthermore, our ablations reveal that extreme parameter reduction (\textit{e.g.,} VeRA, Rank-1) severely bottlenecks reasoning capacity. We further conduct ablation studies and scaling experiments to validate our findings. This work provides a definitive guide for advocating for more exploration for parameter-efficient RL methods.

[578] HELM-BERT: A Transformer for Medium-sized Peptide Property Prediction

Seungeon Lee, Takuto Koyama, Itsuki Maeda, Shigeyuki Matsumoto, Yasushi Okuno

Main category: cs.LG

TL;DR: HELM-BERT is the first encoder-based peptide language model using HELM notation, outperforming SMILES-based models in predicting peptide properties like membrane permeability and protein interactions.

Details

Motivation: Existing molecular language models (SMILES or amino-acid representations) fail to capture the chemical complexity and topological diversity of therapeutic peptides, especially cyclic structures and chemical modifications.

Method: Developed HELM-BERT based on DeBERTa architecture, trained on 39,079 chemically diverse peptides using HELM notation, which explicitly encodes monomer composition and connectivity in a hierarchical framework.

Result: HELM-BERT significantly outperforms state-of-the-art SMILES-based models in downstream tasks including cyclic peptide membrane permeability prediction and peptide-protein interaction prediction.

Conclusion: HELM notation provides substantial data-efficiency advantages for modeling therapeutic peptides, bridging the gap between small-molecule and protein language models with its explicit monomer- and topology-aware representations.

Abstract: Therapeutic peptides have emerged as a pivotal modality in modern drug discovery, occupying a chemically and topologically rich space. While accurate prediction of their physicochemical properties is essential for accelerating peptide development, existing molecular language models rely on representations that fail to capture this complexity. Atom-level SMILES notation generates long token sequences and obscures cyclic topology, whereas amino-acid-level representations cannot encode the diverse chemical modifications central to modern peptide design. To bridge this representational gap, the Hierarchical Editing Language for Macromolecules (HELM) offers a unified framework enabling precise description of both monomer composition and connectivity, making it a promising foundation for peptide language modeling. Here, we propose HELM-BERT, the first encoder-based peptide language model trained on HELM notation. Based on DeBERTa, HELM-BERT is specifically designed to capture hierarchical dependencies within HELM sequences. The model is pre-trained on a curated corpus of 39,079 chemically diverse peptides spanning linear and cyclic structures. HELM-BERT significantly outperforms state-of-the-art SMILES-based language models in downstream tasks, including cyclic peptide membrane permeability prediction and peptide-protein interaction prediction. These results demonstrate that HELM’s explicit monomer- and topology-aware representations offer substantial data-efficiency advantages for modeling therapeutic peptides, bridging a long-standing gap between small-molecule and protein language models.

[579] Machine Learning-Assisted Vocal Cord Ultrasound Examination: Project VIPR

Will Sebelik-Lassiter, Evan Schubert, Muhammad Alliyu, Quentin Robbins, Excel Olatunji, Mustafa Barry

Main category: cs.LG

TL;DR: Machine learning algorithm for vocal cord ultrasound analysis achieves 96% segmentation accuracy and 99% classification accuracy for vocal cord paralysis detection.

Details

Motivation: Vocal cord ultrasound is less invasive but operator-dependent; need for automated analysis to improve diagnostic accuracy.

Method: Used VCUS videos from 30 volunteers, split into frames, cropped uniformly. Trained segmentation and classification models (VIPRnet) on healthy and simulated VCP images.

Result: Segmentation model: 96% validation accuracy. Classification model (VIPRnet): 99% validation accuracy for distinguishing normal vs VCP.

Conclusion: Machine learning-assisted VCUS analysis shows great promise for improving diagnostic accuracy over operator-dependent human interpretation.

Abstract: Intro: Vocal cord ultrasound (VCUS) has emerged as a less invasive and better tolerated examination technique, but its accuracy is operator dependent. This research aims to apply a machine learning-assisted algorithm to automatically identify the vocal cords and distinguish normal vocal cord images from vocal cord paralysis (VCP). Methods: VCUS videos were acquired from 30 volunteers, which were split into still frames and cropped to a uniform size. Healthy and simulated VCP images were used as training data for vocal cord segmentation and VCP classification models. Results: The vocal cord segmentation model achieved a validation accuracy of 96%, while the best classification model (VIPRnet) achieved a validation accuracy of 99%. Conclusion: Machine learning-assisted analysis of VCUS shows great promise in improving diagnostic accuracy over operator-dependent human interpretation.

[580] A Simple, Optimal and Efficient Algorithm for Online Exp-Concave Optimization

Yi-Han Wang, Peng Zhao, Zhi-Hua Zhou

Main category: cs.LG

TL;DR: LightONS reduces computational cost of Online Newton Step from O(d^ωT) to O(d²T + d^ω√T) while maintaining optimal O(d log T) regret, solving a COLT'13 open problem for stochastic exp-concave optimization.

Details

Motivation: The Online Newton Step (ONS) algorithm for online exp-concave optimization has computational bottleneck due to expensive Mahalanobis projections costing Ω(d^ω) per round, leading to total runtime of O(d^ωT). For stochastic optimization, this translates to O(d^{ω+1}/ε) runtime to achieve ε excess risk, which was an open problem posed at COLT'13.

Method: LightONS is a simple variant of ONS that introduces a hysteresis mechanism using domain-conversion techniques from parameter-free online learning. It delays expensive Mahalanobis projections until necessary, reducing the frequency of these costly operations while preserving the algorithm structure.

Result: LightONS achieves O(d²T + d^ω√(T log T)) total runtime while maintaining the optimal O(d log T) regret. For stochastic exp-concave optimization, this yields runtime of O(d³/ε), solving the COLT'13 open problem.

Conclusion: LightONS provides an efficient plug-in replacement for ONS that preserves computational efficiency while maintaining statistical optimality, with applications beyond regret minimization including gradient-norm adaptive regret, parametric stochastic bandits, and memory-efficient online learning.

Abstract: Online eXp-concave Optimization (OXO) is a fundamental problem in online learning. The standard algorithm, Online Newton Step (ONS), balances statistical optimality and computational practicality, guaranteeing an optimal regret of $O(d \log T)$, where $d$ is the dimension and $T$ is the time horizon. ONS faces a computational bottleneck due to the Mahalanobis projections at each round. This step costs $Ω(d^ω)$ arithmetic operations for bounded domains, even for the unit ball, where $ω\in (2,3]$ is the matrix-multiplication exponent. As a result, the total runtime can reach $\tilde{O}(d^ωT)$, particularly when iterates frequently oscillate near the domain boundary. For Stochastic eXp-concave Optimization (SXO), computational cost is also a challenge. Deploying ONS with online-to-batch conversion for SXO requires $T = \tilde{O}(d/ε)$ rounds to achieve an excess risk of $ε$, and thereby necessitates an $\tilde{O}(d^{ω+1}/ε)$ runtime. A COLT'13 open problem posed by Koren [2013] asks for an SXO algorithm with runtime less than $\tilde{O}(d^{ω+1}/ε)$. This paper proposes a simple variant of ONS, LightONS, which reduces the total runtime to $O(d^2 T + d^ω\sqrt{T \log T})$ while preserving the optimal $O(d \log T)$ regret. LightONS implies an SXO method with runtime $\tilde{O}(d^3/ε)$, thereby answering the open problem. Importantly, LightONS preserves the elegant structure of ONS by leveraging domain-conversion techniques from parameter-free online learning to introduce a hysteresis mechanism that delays expensive Mahalanobis projections until necessary. This design enables LightONS to serve as an efficient plug-in replacement of ONS in broader scenarios, even beyond regret minimization, including gradient-norm adaptive regret, parametric stochastic bandits, and memory-efficient online learning.

[581] PGOT: A Physics-Geometry Operator Transformer for Complex PDEs

Zhuo Zhang, Xi Yang, Yuan Zhao, Canqun Yang

Main category: cs.LG

TL;DR: PGOT is a novel Transformer architecture for PDE modeling that addresses geometric aliasing in unstructured meshes by incorporating explicit geometry awareness and adaptive computation routing.

Details

Motivation: Transformers show promise for PDE modeling but struggle with large-scale unstructured meshes and complex geometries. Existing efficient architectures use feature dimensionality reduction that causes Geometric Aliasing, losing critical physical boundary information.

Method: Proposes Physics-Geometry Operator Transformer (PGOT) with Spectrum-Preserving Geometric Attention (SpecGeo-Attention). Uses “physics slicing-geometry injection” mechanism to incorporate multi-scale geometric encodings while maintaining linear complexity O(N). Dynamically routes computations to low-order linear paths for smooth regions and high-order non-linear paths for shock waves/discontinuities based on spatial coordinates.

Result: PGOT achieves consistent state-of-the-art performance across four standard benchmarks and excels in large-scale industrial tasks including airfoil and car designs.

Conclusion: PGOT successfully addresses geometric aliasing in PDE modeling by explicitly incorporating geometry awareness and enabling spatially adaptive physical field modeling, making it effective for both academic benchmarks and real-world industrial applications.

Abstract: While Transformers have demonstrated remarkable potential in modeling Partial Differential Equations (PDEs), modeling large-scale unstructured meshes with complex geometries remains a significant challenge. Existing efficient architectures often employ feature dimensionality reduction strategies, which inadvertently induces Geometric Aliasing, resulting in the loss of critical physical boundary information. To address this, we propose the Physics-Geometry Operator Transformer (PGOT), designed to reconstruct physical feature learning through explicit geometry awareness. Specifically, we propose Spectrum-Preserving Geometric Attention (SpecGeo-Attention). Utilizing a ``physics slicing-geometry injection” mechanism, this module incorporates multi-scale geometric encodings to explicitly preserve multi-scale geometric features while maintaining linear computational complexity $O(N)$. Furthermore, PGOT dynamically routes computations to low-order linear paths for smooth regions and high-order non-linear paths for shock waves and discontinuities based on spatial coordinates, enabling spatially adaptive and high-precision physical field modeling. PGOT achieves consistent state-of-the-art performance across four standard benchmarks and excels in large-scale industrial tasks including airfoil and car designs.

[582] Energy and Memory-Efficient Federated Learning With Ordered Layer Freezing

Ziru Niu, Hai Dong, A. K. Qin, Tao Gu, Pengcheng Zhang

Main category: cs.LG

TL;DR: FedOLF introduces ordered layer freezing and tensor operation approximation to improve federated learning efficiency on IoT devices while maintaining accuracy.

Details

Motivation: Federated Learning faces challenges with IoT edge devices' limited computational power, memory, and bandwidth. Existing approaches like dropout or layer freezing often sacrifice accuracy or neglect memory constraints.

Method: FedOLF uses ordered layer freezing (consistently freezing layers in predefined order before training) and Tensor Operation Approximation (lightweight alternative to quantization) to reduce computation, memory, communication, and energy costs.

Result: FedOLF achieves higher accuracy than existing works on multiple datasets: 0.3% on EMNIST, 6.4% on CIFAR-10, 5.81% and 4.4% on CIFAR-100, 6.27% and 1.29% on CINIC-10, along with better energy efficiency and lower memory footprint.

Conclusion: FedOLF effectively addresses FL challenges on IoT devices by reducing resource requirements while maintaining or improving accuracy compared to existing methods.

Abstract: Federated Learning (FL) has emerged as a privacy-preserving paradigm for training machine learning models across distributed edge devices in the Internet of Things (IoT). By keeping data local and coordinating model training through a central server, FL effectively addresses privacy concerns and reduces communication overhead. However, the limited computational power, memory, and bandwidth of IoT edge devices pose significant challenges to the efficiency and scalability of FL, especially when training deep neural networks. Various FL frameworks have been proposed to reduce computation and communication overheads through dropout or layer freezing. However, these approaches often sacrifice accuracy or neglect memory constraints. To this end, in this work, we introduce Federated Learning with Ordered Layer Freezing (FedOLF). FedOLF consistently freezes layers in a predefined order before training, significantly mitigating computation and memory requirements. To further reduce communication and energy costs, we incorporate Tensor Operation Approximation (TOA), a lightweight alternative to conventional quantization that better preserves model accuracy. Experimental results demonstrate that over non-iid data, FedOLF achieves at least 0.3%, 6.4%, 5.81%, 4.4%, 6.27% and 1.29% higher accuracy than existing works respectively on EMNIST (with CNN), CIFAR-10 (with AlexNet), CIFAR-100 (with ResNet20 and ResNet44), and CINIC-10 (with ResNet20 and ResNet44), along with higher energy efficiency and lower memory footprint.

[583] FairGFL: Privacy-Preserving Fairness-Aware Federated Learning with Overlapping Subgraphs

Zihao Zhou, Shusen Yang, Fangyuan Zhao, Xuebin Ren

Main category: cs.LG

TL;DR: FairGFL addresses unfairness in graph federated learning caused by imbalanced overlapping subgraphs across clients, improving both fairness and model utility through privacy-preserving weighted aggregation and regularization.

Details

Motivation: Graph federated learning faces unfairness issues when overlapping subgraphs are imbalanced across clients. While previous research showed benefits of overlapping data for mitigating heterogeneity, the negative effects of imbalanced overlaps on fairness have not been explored.

Method: Proposes FairGFL with: 1) Interpretable weighted aggregation using privacy-preserving estimation of overlapping ratios to enhance fairness; 2) A carefully crafted regularizer integrated into federated composite loss function to improve tradeoff between model utility and fairness.

Result: Extensive experiments on four benchmark graph datasets show FairGFL outperforms four representative baseline algorithms in both model utility and fairness metrics.

Conclusion: FairGFL successfully addresses the unfairness issue in graph federated learning arising from imbalanced overlapping subgraphs, providing a privacy-preserving solution that enhances cross-client fairness while maintaining model performance.

Abstract: Graph federated learning enables the collaborative extraction of high-order information from distributed subgraphs while preserving the privacy of raw data. However, graph data often exhibits overlap among different clients. Previous research has demonstrated certain benefits of overlapping data in mitigating data heterogeneity. However, the negative effects have not been explored, particularly in cases where the overlaps are imbalanced across clients. In this paper, we uncover the unfairness issue arising from imbalanced overlapping subgraphs through both empirical observations and theoretical reasoning. To address this issue, we propose FairGFL (FAIRness-aware subGraph Federated Learning), a novel algorithm that enhances cross-client fairness while maintaining model utility in a privacy-preserving manner. Specifically, FairGFL incorporates an interpretable weighted aggregation approach to enhance fairness across clients, leveraging privacy-preserving estimation of their overlapping ratios. Furthermore, FairGFL improves the tradeoff between model utility and fairness by integrating a carefully crafted regularizer into the federated composite loss function. Through extensive experiments on four benchmark graph datasets, we demonstrate that FairGFL outperforms four representative baseline algorithms in terms of both model utility and fairness.

[584] PFed-Signal: An ADR Prediction Model based on Federated Learning

Tao Li, Peilin Li, Kui Lu, Yilei Wang, Junliang Shang, Guangshun Li, Huiyu Zhou

Main category: cs.LG

TL;DR: PFed-signal is a federated learning-based model that uses Euclidean distance to eliminate biased data from FAERS for more accurate adverse drug reaction prediction.

Details

Motivation: Traditional ADR prediction methods using FAERS data are biased due to skewed reporting patterns, and statistical methods like ROR/PRR cannot eliminate this bias, leading to inaccurate signal predictions that could misdiagnose patients.

Method: Two-stage approach: 1) PFed-Split divides dataset by ADR type, 2) ADR-signal includes biased data identification using Euclidean distance in federated learning framework, then trains Transformer-based prediction model on cleaned data.

Result: Improved ROR/PRR metrics on cleaned dataset; PFed-signal achieves 0.887 accuracy, 0.890 F1, 0.913 recall, and 0.957 AUC, outperforming baseline methods.

Conclusion: PFed-signal effectively addresses bias in FAERS data through federated learning and Euclidean distance filtering, significantly improving ADR prediction accuracy over traditional statistical methods.

Abstract: The adverse drug reactions (ADRs) predicted based on the biased records in FAERS (U.S. Food and Drug Administration Adverse Event Reporting System) may mislead diagnosis online. Generally, such problems are solved by optimizing reporting odds ratio (ROR) or proportional reporting ratio (PRR). However, these methods that rely on statistical methods cannot eliminate the biased data, leading to inaccurate signal prediction. In this paper, we propose PFed-signal, a federated learning-based signal prediction model of ADR, which utilizes the Euclidean distance to eliminate the biased data from FAERS, thereby improving the accuracy of ADR prediction. Specifically, we first propose Pfed-Split, a method to split the original dataset into a split dataset based on ADR. Then we propose ADR-signal, an ADR prediction model, including a biased data identification method based on federated learning and an ADR prediction model based on Transformer. The former identifies the biased data according to the Euclidean distance and generates a clean dataset by deleting the biased data. The latter is an ADR prediction model based on Transformer trained on the clean data set. The results show that the ROR and PRR on the clean dataset are better than those of the traditional methods. Furthermore, the accuracy rate, F1 score, recall rate and AUC of PFed-Signal are 0.887, 0.890, 0.913 and 0.957 respectively, which are higher than the baselines.

[585] Splitwise: Collaborative Edge-Cloud Inference for LLMs via Lyapunov-Assisted DRL

Abolfazl Younesi, Abbas Shabrang Maryan, Elyas Oustad, Zahra Najafabadi Samani, Mohsen Ansari, Thomas Fahringer

Main category: cs.LG

TL;DR: Splitwise is a Lyapunov-assisted DRL framework for adaptive fine-grained partitioning of LLMs across edge and cloud, reducing latency by 1.4x-2.8x and energy by up to 41% while maintaining accuracy.

Details

Motivation: LLM deployment on edge devices faces memory/power limitations, cloud-only inference has high latency/cost, and static partitions struggle with bandwidth fluctuations.

Method: Decomposes transformer layers into attention heads and feed-forward sub-blocks for fine-grained partitioning. Uses hierarchical DRL policy guided by Lyapunov optimization to jointly optimize latency, energy, and accuracy while guaranteeing queue stability. Includes robustness via partition checkpoints with exponential backoff recovery.

Result: Reduces end-to-end latency by 1.4x-2.8x, cuts energy consumption by up to 41%, lowers 95th-percentile latency by 53-61% vs cloud-only, while maintaining accuracy with modest memory requirements.

Conclusion: Splitwise enables efficient LLM deployment on resource-constrained edge devices through adaptive fine-grained partitioning that dynamically responds to network conditions and workload variations.

Abstract: Deploying large language models (LLMs) on edge devices is challenging due to their limited memory and power resources. Cloud-only inference reduces device burden but introduces high latency and cost. Static edge-cloud partitions optimize a single metric and struggle when bandwidth fluctuates. We propose Splitwise, a novel Lyapunov-assisted deep reinforcement learning (DRL) framework for fine-grained, adaptive partitioning of LLMs across edge and cloud environments. Splitwise decomposes transformer layers into attention heads and feed-forward sub-blocks, exposing more partition choices than layer-wise schemes. A hierarchical DRL policy, guided by Lyapunov optimization, jointly minimizes latency, energy consumption, and accuracy degradation while guaranteeing queue stability under stochastic workloads and variable network bandwidth. Splitwise also guarantees robustness via partition checkpoints with exponential backoff recovery in case of communication failures. Experiments on Jetson Orin NX, Galaxy S23, and Raspberry Pi 5 with GPT-2 (1.5B), LLaMA-7B, and LLaMA-13B show that Splitwise reduces end-to-end latency by 1.4x-2.8x and cuts energy consumption by up to 41% compared with existing partitioners. It lowers the 95th-percentile latency by 53-61% relative to cloud-only execution, while maintaining accuracy and modest memory requirements.

[586] On the Inverse Flow Matching Problem in the One-Dimensional and Gaussian Cases

Alexander Korotin, Gudmund Pammer

Main category: cs.LG

TL;DR: This paper studies the inverse problem of flow matching between distributions with finite exponential moment, establishing uniqueness in 1D and Gaussian cases, while leaving the general multidimensional problem open.

Details

Motivation: The research is motivated by generative AI applications, particularly the distillation of flow matching models, which requires understanding the inverse problem of flow matching between distributions.

Method: The paper studies the inverse problem of flow matching between distributions with finite exponential moment, establishing uniqueness results through mathematical analysis.

Result: Uniqueness of the solution is proven in two cases: the one-dimensional setting and the Gaussian case, providing theoretical foundations for these specific scenarios.

Conclusion: The general multidimensional problem remains open for future studies, highlighting the need for further research beyond the established 1D and Gaussian cases.

Abstract: This paper studies the inverse problem of flow matching (FM) between distributions with finite exponential moment, a problem motivated by modern generative AI applications such as the distillation of flow matching models. Uniqueness of the solution is established in two cases - the one-dimensional setting and the Gaussian case. The general multidimensional problem remains open for future studies.

[587] Spectral Analysis of Hard-Constraint PINNs: The Spatial Modulation Mechanism of Boundary Functions

Yuchen Xie, Honghang Chi, Haopeng Quan, Yahui Wang, Wei Wang, Yu Ma

Main category: cs.LG

TL;DR: HC-PINNs use hard constraints to enforce boundary conditions via trial functions, but their training dynamics were poorly understood. This work develops an NTK framework showing boundary functions act as spectral filters, with effective rank predicting convergence better than condition numbers.

Details

Motivation: While HC-PINNs are increasingly popular for strictly enforcing boundary conditions through trial functions, the theoretical understanding of their training dynamics has been lacking. Unlike soft-constrained PINNs where boundary terms are additive penalties, the multiplicative nature of hard constraints fundamentally changes the learning landscape, necessitating a rigorous theoretical analysis.

Method: Established a rigorous Neural Tangent Kernel (NTK) framework for HC-PINNs, deriving explicit kernel composition laws. Conducted spectral analysis to show how boundary functions reshape the eigenspectrum of the neural network’s native kernel. Identified effective rank of the residual kernel as a key metric for predicting training convergence.

Result: Boundary functions act as spectral filters that can inadvertently induce spectral collapse, leading to optimization stagnation despite exact boundary satisfaction. Effective rank of the residual kernel serves as a deterministic predictor of training convergence, superior to classical condition numbers. Validated findings across multi-dimensional benchmarks.

Conclusion: The framework transforms boundary function design from a heuristic choice into a principled spectral optimization problem, providing solid theoretical foundation for geometric hard constraints in scientific machine learning. This enables systematic design of boundary functions that avoid spectral collapse and ensure efficient training convergence.

Abstract: Physics-Informed Neural Networks with hard constraints (HC-PINNs) are increasingly favored for their ability to strictly enforce boundary conditions via a trial function ansatz $\tilde{u} = A + B \cdot N$, yet the theoretical mechanisms governing their training dynamics have remained unexplored. Unlike soft-constrained formulations where boundary terms act as additive penalties, this work reveals that the boundary function $B$ introduces a multiplicative spatial modulation that fundamentally alters the learning landscape. A rigorous Neural Tangent Kernel (NTK) framework for HC-PINNs is established, deriving the explicit kernel composition law. This relationship demonstrates that the boundary function $B(\vec{x})$ functions as a spectral filter, reshaping the eigenspectrum of the neural network’s native kernel. Through spectral analysis, the effective rank of the residual kernel is identified as a deterministic predictor of training convergence, superior to classical condition numbers. It is shown that widely used boundary functions can inadvertently induce spectral collapse, leading to optimization stagnation despite exact boundary satisfaction. Validated across multi-dimensional benchmarks, this framework transforms the design of boundary functions from a heuristic choice into a principled spectral optimization problem, providing a solid theoretical foundation for geometric hard constraints in scientific machine learning.

[588] ECG-RAMBA: Zero-Shot ECG Generalization by Morphology-Rhythm Disentanglement and Long-Range Modeling

Hai Duong Nguyen, Xuan-The Tran

Main category: cs.LG

TL;DR: ECG-RAMBA separates ECG morphology and rhythm features, then fuses them with bi-directional Mamba for robust cross-dataset ECG classification, achieving strong zero-shot transfer performance.

Details

Motivation: Deep learning for ECG classification struggles with generalization across heterogeneous acquisition settings due to entanglement of morphological and rhythm patterns, leading to shortcut learning and sensitivity to distribution shifts.

Method: ECG-RAMBA separates morphology (MiniRocket features) and rhythm (HRV descriptors), then fuses them via bi-directional Mamba backbone with Power Mean pooling (Q=3) for transient abnormality detection.

Result: Achieves macro ROC-AUC ≈0.85 on Chapman-Shaoxing, PR-AUC=0.708 for atrial fibrillation detection on CPSC-2021 in zero-shot transfer, and consistent performance on PTB-XL, outperforming raw-signal Mamba baseline.

Conclusion: Separating morphology and rhythm with explicit modeling and long-range context is critical for cross-domain robustness in ECG classification, with deterministic morphology providing a strong foundation.

Abstract: Deep learning has achieved strong performance for electrocardiogram (ECG) classification within individual datasets, yet dependable generalization across heterogeneous acquisition settings remains a major obstacle to clinical deployment and longitudinal monitoring. A key limitation of many model architectures is the implicit entanglement of morphological waveform patterns and rhythm dynamics, which can promote shortcut learning and amplify sensitivity to distribution shifts. We propose ECG-RAMBA, a framework that separates morphology and rhythm and then re-integrates them through context-aware fusion. ECG-RAMBA combines: (i) deterministic morphological features extracted by MiniRocket, (ii) global rhythm descriptors computed from heart-rate variability (HRV), and (iii) long-range contextual modeling via a bi-directional Mamba backbone. To improve sensitivity to transient abnormalities under windowed inference, we introduce a numerically stable Power Mean pooling operator ($Q=3$) that emphasizes high-evidence segments while avoiding the brittleness of max pooling and the dilution of averaging. We evaluate under a protocol-faithful setting with subject-level cross-validation, a fixed decision threshold, and no test-time adaptation. On the Chapman–Shaoxing dataset, ECG-RAMBA achieves a macro ROC-AUC $\approx 0.85$. In zero-shot transfer, it attains PR-AUC $=0.708$ for atrial fibrillation detection on the external CPSC-2021 dataset, substantially outperforming a comparable raw-signal Mamba baseline, and shows consistent cross-dataset performance on PTB-XL. Ablation studies indicate that deterministic morphology provides a strong foundation, while explicit rhythm modeling and long-range context are critical drivers of cross-domain robustness.

[589] Deep learning for pedestrians: backpropagation in Transformers

Laurent Boué

Main category: cs.LG

TL;DR: The paper provides a vectorized derivation of backpropagation for transformer architectures using an index-free methodology, covering layers like embeddings, multi-headed self-attention, layer normalization, and LoRA, with a complete PyTorch implementation.

Details

Motivation: To gain deeper intuition about how operations influence final outputs in transformer architectures by manually working through backward passes, despite the availability of automatic differentiation tools. This helps identify gaps in understanding of forward value propagation.

Method: Applies a lightweight index-free methodology to derive backpropagation for transformer layers including embeddings, multi-headed self-attention, layer normalization, and LoRA layers. Provides analytical gradient expressions and a complete PyTorch implementation of a minimalistic GPT-like network.

Result: Provides complete analytical expressions for gradient updates in transformer architectures, including parameter-efficient fine-tuning with LoRA layers, along with a working PyTorch implementation that demonstrates the derivations.

Conclusion: Manual derivation of backpropagation for transformers provides valuable intuition about how operations affect gradients, complementing automatic differentiation tools. The index-free methodology successfully handles complex transformer components and enables better understanding of gradient flow in next-token-prediction architectures.

Abstract: This document is a follow-up to our previous paper dedicated to a vectorized derivation of backpropagation in CNNs. Following the same principles and notations already put in place there, we now focus on transformer-based next-token-prediction architectures. To this end, we apply our lightweight index-free methodology to new types of layers such as embedding, multi-headed self-attention and layer normalization. In addition, we also provide gradient expressions for LoRA layers to illustrate parameter-efficient fine-tuning. Why bother doing manual backpropagation when there are so many tools that do this automatically? Any gap in understanding of how values propagate forward will become evident when attempting to differentiate the loss function. By working through the backward pass manually, we gain a deeper intuition for how each operation influences the final output. A complete PyTorch implementation of a minimalistic GPT-like network is also provided along with analytical expressions for of all of its gradient updates.

[590] DE$^3$-BERT: Distance-Enhanced Early Exiting for BERT based on Prototypical Networks

Jianing He, Qi Zhang, Weiping Ding, Duoqian Miao, Jun Zhao, Liang Hu, Longbing Cao

Main category: cs.LG

TL;DR: DE³-BERT: Distance-Enhanced Early Exiting framework for BERT that combines local entropy and global distance metrics to improve exiting decisions, achieving better performance-efficiency trade-offs.

Details

Motivation: Existing early exiting methods only use local information from individual test samples, ignoring valuable global information from the sample population, leading to suboptimal exiting decisions and erroneous predictions.

Method: Proposes DE³-BERT framework that uses prototypical networks to learn class prototypes and measure distance between samples and prototypes. Implements hybrid exiting strategy combining classic entropy-based local information with distance-based global information.

Result: Extensive experiments on GLUE benchmark show DE³-BERT consistently outperforms state-of-the-art models under different speed-up ratios with minimal overhead, achieving better trade-off between performance and inference efficiency.

Conclusion: Effectively combining local and global information enables more reliable early exiting decisions. The method demonstrates generality and interpretability while maintaining computational efficiency.

Abstract: Early exiting has demonstrated its effectiveness in accelerating the inference of pre-trained language models like BERT by dynamically adjusting the number of layers executed. However, most existing early exiting methods only consider local information from an individual test sample to determine their exiting indicators, failing to leverage the global information offered by sample population. This leads to suboptimal estimation of prediction correctness, resulting in erroneous exiting decisions. To bridge the gap, we explore the necessity of effectively combining both local and global information to ensure reliable early exiting during inference. Purposefully, we leverage prototypical networks to learn class prototypes and devise a distance metric between samples and class prototypes. This enables us to utilize global information for estimating the correctness of early predictions. On this basis, we propose a novel Distance-Enhanced Early Exiting framework for BERT (DE$^3$-BERT). DE$^3$-BERT implements a hybrid exiting strategy that supplements classic entropy-based local information with distance-based global information to enhance the estimation of prediction correctness for more reliable early exiting decisions. Extensive experiments on the GLUE benchmark demonstrate that DE$^3$-BERT consistently outperforms state-of-the-art models under different speed-up ratios with minimal storage or computational overhead, yielding a better trade-off between model performance and inference efficiency. Additionally, an in-depth analysis further validates the generality and interpretability of our method.

[591] ISOPO: Proximal policy gradients without pi-old

Nilin Abrahamsen

Main category: cs.LG

TL;DR: ISOPO is a single-gradient-step method that approximates natural policy gradient by normalizing log-probability gradients in Fisher metric before contracting with advantages, with negligible computational overhead compared to REINFORCE.

Details

Motivation: Existing proximal policy methods like GRPO or CISPO require multiple gradient steps with importance ratio clipping to approximate natural gradient steps, which is computationally expensive. ISOPO aims to provide an efficient single-step approximation.

Method: ISOPO normalizes the log-probability gradient of each sequence in the Fisher metric before contracting with advantages. A variant transforms microbatch advantages based on the neural tangent kernel in each layer, applied layer-wise in a single backward pass.

Result: ISOPO achieves efficient approximation of natural policy gradient in a single gradient step with negligible computational overhead compared to vanilla REINFORCE.

Conclusion: ISOPO provides a computationally efficient alternative to existing proximal policy methods by enabling single-step natural gradient approximation through Fisher metric normalization and neural tangent kernel transformations.

Abstract: This note introduces Isometric Policy Optimization (ISOPO), an efficient method to approximate the natural policy gradient in a single gradient step. In comparison, existing proximal policy methods such as GRPO or CISPO use multiple gradient steps with variants of importance ratio clipping to approximate a natural gradient step relative to a reference policy. In its simplest form, ISOPO normalizes the log-probability gradient of each sequence in the Fisher metric before contracting with the advantages. Another variant of ISOPO transforms the microbatch advantages based on the neural tangent kernel in each layer. ISOPO applies this transformation layer-wise in a single backward pass and can be implemented with negligible computational overhead compared to vanilla REINFORCE.

[592] Post-Training Quantization of OpenPangu Models for Efficient Deployment on Atlas A2

Yilun Luo, HuaQing Zheng, Haoqian Meng, Wenyuan Liu, Peng Zhang

Main category: cs.LG

TL;DR: Low-bit quantization (INT8/W4A8) enables efficient deployment of Huawei’s openPangu-Embedded models with CoT reasoning on Ascend NPUs, reducing memory/latency overhead while preserving accuracy.

Details

Motivation: The Chain-of-Thought (CoT) reasoning modes in openPangu-Embedded models generate extended reasoning traces that cause substantial memory and latency overheads, making practical deployment on Ascend NPUs challenging.

Method: Introduce a unified low-bit inference framework supporting INT8 (W8A8) and W4A8 quantization, specifically optimized for openPangu-Embedded models on Atlas A2 NPUs, transforming FP16 computations into efficient integer arithmetic.

Result: INT8 quantization preserves over 90% of FP16 baseline accuracy with 1.5x prefill speedup on Atlas A2; W4A8 quantization significantly reduces memory consumption with moderate accuracy trade-off, evaluated across all three CoT modes on code generation benchmarks (HumanEval and MBPP).

Conclusion: Low-bit quantization effectively facilitates efficient CoT reasoning on Ascend NPUs while maintaining high model fidelity, addressing computational constraints for practical deployment.

Abstract: Huawei’s openPangu-Embedded-1B and openPangu-Embedded-7B, variants of the openPangu large language model, integrate three distinct Chain-of-Thought (CoT) reasoning paradigms, namely slow_think, auto_think, and no_think. While these CoT modes enhance reasoning capabilities, their generation of extended reasoning traces introduces substantial memory and latency overheads, posing challenges for practical deployment on Ascend NPUs. This paper addresses these computational constraints by leveraging low-bit quantization, which transforms FP16 computations into more efficient integer arithmetic. We introduce a unified low-bit inference framework, supporting INT8 (W8A8) and W4A8 quantization, specifically optimized for openPangu-Embedded models on the Atlas A2. Our comprehensive evaluation, conducted across all three CoT modes on code generation benchmarks (HumanEval and MBPP), demonstrates the efficacy of this approach. INT8 quantization consistently preserves over 90% of the FP16 baseline accuracy and achieves a 1.5x prefill speedup on the Atlas A2. Furthermore, W4A8 quantization significantly reduces memory consumption, albeit with a moderate trade-off in accuracy. These findings collectively indicate that low-bit quantization effectively facilitates efficient CoT reasoning on Ascend NPUs, maintaining high model fidelity.

[593] Diffusion priors enhanced velocity model building from time-lag images using a neural operator

Xiao Ma, Mohammad Hasyim Taufik, Tariq Alkhalifah

Main category: cs.LG

TL;DR: Proposes a novel framework combining generative models with neural operators for efficient high-resolution velocity model building, using neural operators as forward mapping operators and generative models as regularizers.

Details

Motivation: Conventional velocity model building methods are computationally expensive and time-consuming. Deep learning approaches, particularly generative models and neural operators, offer potential to address these limitations by integrating data statistics for more efficient subsurface imaging.

Method: Combines generative models with neural operators: 1) Neural operator acts as forward mapping operator to rapidly generate time lag RTM extended images from true and migration velocity models, 2) Uses automatic differentiation to gradually update migration velocity to match observed data, 3) Embeds generative model trained on high-resolution velocity model distribution as regularizer for cleaner predictions.

Result: Both synthetic and field data experiments demonstrate the effectiveness of the proposed generative neural operator based velocity model building approach, producing cleaner predictions with higher resolution information.

Conclusion: The proposed framework successfully integrates generative models with neural operators to achieve efficient high-resolution velocity model building, addressing computational limitations of traditional methods while maintaining accuracy.

Abstract: Velocity model building serves as a crucial component for achieving high precision subsurface imaging. However, conventional velocity model building methods are often computationally expensive and time consuming. In recent years, with the rapid advancement of deep learning, particularly the success of generative models and neural operators, deep learning based approaches that integrate data and their statistics have attracted increasing attention in addressing the limitations of traditional methods. In this study, we propose a novel framework that combines generative models with neural operators to obtain high resolution velocity models efficiently. Within this workflow, the neural operator functions as a forward mapping operator to rapidly generate time lag reverse time migration (RTM) extended images from the true and migration velocity models. In this framework, the neural operator is acting as a surrogate for modeling followed by migration, which uses the true and migration velocities, respectively. The trained neural operator is then employed, through automatic differentiation, to gradually update the migration velocity placed in the true velocity input channel with high resolution components so that the output of the network matches the time lag images of observed data obtained using the migration velocity. By embedding a generative model, trained on a high-resolution velocity model distribution, which corresponds to the true velocity model distribution used to train the neural operator, as a regularizer, the resulting predictions are cleaner with higher resolution information. Both synthetic and field data experiments demonstrate the effectiveness of the proposed generative neural operator based velocity model building approach.

[594] AdvPrefix: An Objective for Nuanced LLM Jailbreaks

Sicheng Zhu, Brandon Amos, Yuandong Tian, Chuan Guo, Ivan Evtimov

Main category: cs.LG

TL;DR: AdvPrefix introduces a plug-and-play prefix-forcing objective that selects model-dependent prefixes to improve jailbreak attacks on LLMs, overcoming limitations of the common “Sure, here is” approach.

Details

Motivation: Current jailbreak attacks rely on the common "Sure, here is (harmful request)" prefix, which has two key limitations: limited control over model behaviors (yielding incomplete/unrealistic responses) and rigid format that hinders optimization.

Method: AdvPrefix selects one or more model-dependent prefixes by combining two criteria: high prefilling attack success rates and low negative log-likelihood. It integrates seamlessly into existing jailbreak attacks as a plug-and-play objective.

Result: Replacing GCG’s default prefixes on Llama-3 improves nuanced attack success rates from 14% to 80%, revealing that current safety alignment fails to generalize to new prefixes.

Conclusion: AdvPrefix effectively mitigates limitations of existing jailbreak objectives, demonstrating that current safety alignment is vulnerable to novel prefix-based attacks, with code and selected prefixes released publicly.

Abstract: Many jailbreak attacks on large language models (LLMs) rely on a common objective: making the model respond with the prefix ``Sure, here is (harmful request)’’. While straightforward, this objective has two limitations: limited control over model behaviors, yielding incomplete or unrealistic jailbroken responses, and a rigid format that hinders optimization. We introduce AdvPrefix, a plug-and-play prefix-forcing objective that selects one or more model-dependent prefixes by combining two criteria: high prefilling attack success rates and low negative log-likelihood. AdvPrefix integrates seamlessly into existing jailbreak attacks to mitigate the previous limitations for free. For example, replacing GCG’s default prefixes on Llama-3 improves nuanced attack success rates from 14% to 80%, revealing that current safety alignment fails to generalize to new prefixes. Code and selected prefixes are released at github.com/facebookresearch/jailbreak-objectives.

[595] A unified framework for detecting point and collective anomalies in operating system logs via collaborative transformers

Mohammad Nasirzadeh, Jafar Tahmoresnezhad, Parviz Rashidi-Khazaee

Main category: cs.LG

TL;DR: CoLog is a multimodal log anomaly detection framework that collaboratively encodes different log modalities using transformers and attention mechanisms to detect both point and collective anomalies with high accuracy.

Details

Motivation: Existing unimodal methods ignore different log data modalities, while multimodal methods fail to handle interactions between modalities. Logs contain various information types (modalities) that need collaborative analysis for effective anomaly detection.

Method: Uses collaborative transformers and multi-head impressed attention to learn interactions among log modalities. Incorporates a modality adaptation layer to handle heterogeneity from these interactions, enabling nuanced pattern learning.

Result: Achieves mean precision of 99.63%, mean recall of 99.59%, and mean F1 score of 99.61% across seven benchmark datasets. Superior to state-of-the-art methods in detecting both point and collective anomalies.

Conclusion: CoLog represents significant advancement in log anomaly detection, providing sophisticated solution for cybersecurity, system monitoring, and operational efficiency through unified multimodal framework.

Abstract: Log anomaly detection is crucial for preserving the security of operating systems. Depending on the source of log data collection, various information is recorded in logs that can be considered log modalities. In light of this intuition, unimodal methods often struggle by ignoring the different modalities of log data. Meanwhile, multimodal methods fail to handle the interactions between these modalities. Applying multimodal sentiment analysis to log anomaly detection, we propose CoLog, a framework that collaboratively encodes logs utilizing various modalities. CoLog utilizes collaborative transformers and multi-head impressed attention to learn interactions among several modalities, ensuring comprehensive anomaly detection. To handle the heterogeneity caused by these interactions, CoLog incorporates a modality adaptation layer, which adapts the representations from different log modalities. This methodology enables CoLog to learn nuanced patterns and dependencies within the data, enhancing its anomaly detection capabilities. Extensive experiments demonstrate CoLog’s superiority over existing state-of-the-art methods. Furthermore, in detecting both point and collective anomalies, CoLog achieves a mean precision of 99.63%, a mean recall of 99.59%, and a mean F1 score of 99.61% across seven benchmark datasets for log-based anomaly detection. The comprehensive detection capabilities of CoLog make it highly suitable for cybersecurity, system monitoring, and operational efficiency. CoLog represents a significant advancement in log anomaly detection, providing a sophisticated and effective solution to point and collective anomaly detection through a unified framework and a solution to the complex challenges automatic log data analysis poses. We also provide the implementation of CoLog at https://github.com/NasirzadehMoh/CoLog.

Nathan Buskulic, Luca Calatroni, Lorenzo Rosasco, Silvia Villa

Main category: cs.LG

TL;DR: The paper provides a theoretical framework for learning in blind inverse problems using Linear Minimum Mean Square Estimators, establishing equivalences with Tikhonov regularization and deriving finite-sample error bounds.

Details

Motivation: Blind inverse problems where forward operators are unknown present challenges for existing methods. Data-driven approaches lack interpretability and theoretical guarantees, limiting reliability in applied domains like imaging.

Method: The authors analyze blind inverse problems using Linear Minimum Mean Square Estimators (LMMSEs) framework. They derive closed-form expressions for optimal estimators, establish equivalences with Tikhonov-regularized formulations, and prove convergence results under source conditions.

Result: Theoretical analysis provides rigorous finite-sample error bounds characterizing performance as function of noise level, problem conditioning, and sample size. Bounds explicitly quantify impact of operator randomness and reveal convergence rates as randomness vanishes.

Conclusion: The work provides a solid theoretical foundation for learning in blind inverse problems, bridging the gap between data-driven approaches and rigorous mathematical guarantees, with numerical experiments validating theoretical predictions.

Abstract: Blind inverse problems arise in many experimental settings where the forward operator is partially or entirely unknown. In this context, methods developed for the non-blind case cannot be adapted in a straightforward manner. Recently, data-driven approaches have been proposed to address blind inverse problems, demonstrating strong empirical performance and adaptability. However, these methods often lack interpretability and are not supported by rigorous theoretical guarantees, limiting their reliability in applied domains such as imaging inverse problems. In this work, we shed light on learning in blind inverse problems within the simplified yet insightful framework of Linear Minimum Mean Square Estimators (LMMSEs). We provide an in-depth theoretical analysis, deriving closed-form expressions for optimal estimators and extending classical results. In particular, we establish equivalences with suitably chosen Tikhonov-regularized formulations, where the regularization depends explicitly on the distributions of the unknown signal, the noise, and the random forward operators. We also prove convergence results under appropriate source condition assumptions. Furthermore, we derive rigorous finite-sample error bounds that characterize the performance of learned estimators as a function of the noise level, problem conditioning, and number of available samples. These bounds explicitly quantify the impact of operator randomness and reveal the associated convergence rates as this randomness vanishes. Finally, we validate our theoretical findings through illustrative numerical experiments that confirm the predicted convergence behavior.

[597] Quantifying True Robustness: Synonymity-Weighted Similarity for Trustworthy XAI Evaluation

Christopher Burger

Main category: cs.LG

TL;DR: The paper proposes synonymity weighting to improve evaluation of adversarial attacks on text-based XAI, arguing that standard metrics overestimate attack success by treating all word perturbations equally without considering semantic similarity.

Details

Motivation: Standard information retrieval metrics for evaluating adversarial attacks on text-based XAI are poorly suited for assessing trustworthiness because they treat all word perturbations equally and ignore synonymity, which can misrepresent an attack's true impact on explanation reliability.

Method: The authors apply synonymity weighting, a method that amends standard evaluation measures by incorporating semantic similarity of perturbed words, producing more accurate vulnerability assessments for adversarial attacks on XAI systems.

Result: The approach prevents overestimation of attack success and provides more faithful understanding of XAI system resilience against adversarial manipulation, offering an important tool for assessing AI system robustness.

Conclusion: Synonymity weighting produces more accurate vulnerability assessments for adversarial attacks on text-based XAI, addressing limitations of standard metrics and enabling better evaluation of AI system trustworthiness and robustness.

Abstract: Adversarial attacks challenge the reliability of Explainable AI (XAI) by altering explanations while the model’s output remains unchanged. The success of these attacks on text-based XAI is often judged using standard information retrieval metrics. We argue these measures are poorly suited in the evaluation of trustworthiness, as they treat all word perturbations equally while ignoring synonymity, which can misrepresent an attack’s true impact. To address this, we apply synonymity weighting, a method that amends these measures by incorporating the semantic similarity of perturbed words. This produces more accurate vulnerability assessments and provides an important tool for assessing the robustness of AI systems. Our approach prevents the overestimation of attack success, leading to a more faithful understanding of an XAI system’s true resilience against adversarial manipulation.

[598] Task-driven Heterophilic Graph Structure Learning

Ayushman Raghuvanshi, Gonzalo Mateos, Sundeep Prabhakar Chepuri

Main category: cs.LG

TL;DR: FgGSL is a frequency-guided graph structure learning framework that jointly learns homophilic and heterophilic graph structures with spectral encoding to improve GNN performance on heterophilic graphs.

Details

Motivation: GNNs struggle with heterophilic graphs where connected nodes have dissimilar labels, as traditional GNNs rely on homophily assumption and feature similarity provides weak structural cues for these graphs.

Method: Uses learnable symmetric feature-driven masking to infer complementary graphs, processes them with low- and high-pass graph filter banks, and employs label-based structural loss to promote recovery of both homophilic and heterophilic edges.

Result: Outperforms state-of-the-art GNNs and graph rewiring methods on six heterophilic benchmarks, with derived stability bounds for structural loss and robustness guarantees for filter banks under perturbations.

Conclusion: Combining frequency information with supervised topology inference effectively addresses heterophilic graph challenges, demonstrating benefits of joint learning of complementary graph structures with spectral encoding.

Abstract: Graph neural networks (GNNs) often struggle to learn discriminative node representations for heterophilic graphs, where connected nodes tend to have dissimilar labels and feature similarity provides weak structural cues. We propose frequency-guided graph structure learning (FgGSL), an end-to-end graph inference framework that jointly learns homophilic and heterophilic graph structures along with a spectral encoder. FgGSL employs a learnable, symmetric, feature-driven masking function to infer said complementary graphs, which are processed using pre-designed low- and high-pass graph filter banks. A label-based structural loss explicitly promotes the recovery of homophilic and heterophilic edges, enabling task-driven graph structure learning. We derive stability bounds for the structural loss and establish robustness guarantees for the filter banks under graph perturbations. Experiments on six heterophilic benchmarks demonstrate that FgGSL consistently outperforms state-of-the-art GNNs and graph rewiring methods, highlighting the benefits of combining frequency information with supervised topology inference.

[599] Directly Constructing Low-Dimensional Solution Subspaces in Deep Neural Networks

Yusuf Kalyoncuoglu

Main category: cs.LG

TL;DR: The paper proposes a method to compress neural network classification heads by up to 16x with minimal performance loss by decoupling solution geometry from ambient search space, enabling efficient “Train Big, Deploy Small” through Subspace-Native Distillation.

Details

Motivation: Current deep neural networks use massive high-dimensional widths not for representation but to solve the non-convex optimization problem of finding global minima, which remains intractable for compact networks. This redundancy creates an optimization bottleneck.

Method: A constructive approach that decouples solution geometry from ambient search space, allowing compression of classification heads by huge factors. Introduces Subspace-Native Distillation where the target is defined directly in the constructed subspace, providing stable geometric coordinates for student models.

Result: Empirical demonstration across ResNet-50, ViT, and BERT shows classification heads can be compressed by factors up to 16 with negligible performance degradation.

Conclusion: The approach enables student models to circumvent high-dimensional search problems entirely, potentially realizing the vision of “Train Big, Deploy Small” by providing a stable geometric coordinate system in low-dimensional subspaces.

Abstract: While it is well-established that the weight matrices and feature manifolds of deep neural networks exhibit a low Intrinsic Dimension (ID), current state-of-the-art models still rely on massive high-dimensional widths. This redundancy is not required for representation, but is strictly necessary to solve the non-convex optimization search problem-finding a global minimum, which remains intractable for compact networks. In this work, we propose a constructive approach to bypass this optimization bottleneck. By decoupling the solution geometry from the ambient search space, we empirically demonstrate across ResNet-50, ViT, and BERT that the classification head can be compressed by even huge factors of 16 with negligible performance degradation. This motivates Subspace-Native Distillation as a novel paradigm: by defining the target directly in this constructed subspace, we provide a stable geometric coordinate system for student models, potentially allowing them to circumvent the high-dimensional search problem entirely and realize the vision of Train Big, Deploy Small.

[600] Stochastic Siamese MAE Pretraining for Longitudinal Medical Images

Taha Emre, Arunava Chakravarty, Thomas Pinetz, Dmitrii Lachinov, Martin J. Menten, Hendrik Scholl, Sobha Sivaprasad, Daniel Rueckert, Andrew Lotery, Stefan Sacu, Ursula Schmidt-Erfurth, Hrvoje Bogunović

Main category: cs.LG

TL;DR: STAMP is a stochastic temporal autoencoder framework that enhances MAE with temporal awareness for longitudinal medical imaging by conditioning on time differences and modeling disease progression uncertainty.

Details

Motivation: Current self-supervised methods like MAE lack temporal awareness needed for capturing disease progression in longitudinal medical datasets, and deterministic approaches fail to account for uncertainty in disease evolution.

Method: STAMP uses a Siamese MAE framework with stochastic process encoding, conditioning on time differences between input volumes and reframing MAE reconstruction loss as a conditional variational inference objective.

Result: STAMP pretrained ViT models outperformed existing temporal MAE methods and foundation models on Age-Related Macular Degeneration and Alzheimer’s Disease progression prediction tasks across OCT and MRI datasets.

Conclusion: STAMP successfully incorporates temporal awareness and uncertainty modeling into self-supervised learning for longitudinal medical imaging, improving disease progression prediction by learning non-deterministic temporal dynamics.

Abstract: Temporally aware image representations are crucial for capturing disease progression in 3D volumes of longitudinal medical datasets. However, recent state-of-the-art self-supervised learning approaches like Masked Autoencoding (MAE), despite their strong representation learning capabilities, lack temporal awareness. In this paper, we propose STAMP (Stochastic Temporal Autoencoder with Masked Pretraining), a Siamese MAE framework that encodes temporal information through a stochastic process by conditioning on the time difference between the 2 input volumes. Unlike deterministic Siamese approaches, which compare scans from different time points but fail to account for the inherent uncertainty in disease evolution, STAMP learns temporal dynamics stochastically by reframing the MAE reconstruction loss as a conditional variational inference objective. We evaluated STAMP on two OCT and one MRI datasets with multiple visits per patient. STAMP pretrained ViT models outperformed both existing temporal MAE methods and foundation models on different late stage Age-Related Macular Degeneration and Alzheimer’s Disease progression prediction which require models to learn the underlying non-deterministic temporal dynamics of the diseases.

[601] Dynamic Subspace Composition: Efficient Adaptation via Contractive Basis Expansion

Vladimer Khasia

Main category: cs.LG

TL;DR: DSC is a new MoE framework that reduces parameter complexity from O(M rd) to O(M d) by constructing compositional rank-K approximations from decoupled unit-norm basis vectors, while ensuring continuity and stability.

Details

Motivation: Mixture of Experts models suffer from representation collapse and gradient instability despite scaling capacity, needing more efficient and stable parameterization methods.

Method: Dynamic Subspace Composition approximates context-dependent weights via sparse expansion of shared basis bank, modeling weight updates as residual trajectory within Star-Shaped Domain with Magnitude-Gated Simplex Interpolation for continuity.

Result: Reduces parameter complexity from O(M rd) to O(M d) and memory traffic to O(Kd), with Frame-Theoretic regularization and spectral constraints providing rigorous worst-case bounds on dynamic updates.

Conclusion: DSC offers a more efficient and stable alternative to standard Mixture-of-LoRAs with better parameter complexity and theoretical guarantees for dynamic weight updates.

Abstract: Mixture of Experts (MoE) models scale capacity but often suffer from representation collapse and gradient instability. We propose Dynamic Subspace Composition (DSC), a framework that approximates context-dependent weights via a state-dependent, sparse expansion of a shared basis bank. Formally, DSC models the weight update as a residual trajectory within a Star- Shaped Domain, employing a Magnitude-Gated Simplex Interpolation to ensure continuity at the identity. Unlike standard Mixture-of-LoRAs, which incurs O(M rd) parameter complexity by retrieving independent rank-r matrices, DSC constructs a compositional rank-K approximation from decoupled unit-norm basis vectors. This reduces parameter complexity to O(M d) and memory traffic to O(Kd), while Frame-Theoretic regularization and spectral constraints provide rigorous worst-case bounds on the dynamic update. The code is available at https://github. com/VladimerKhasia/DSC

[602] Rotation Control Unlearning: Quantifying and Controlling Continuous Unlearning for LLM with The Cognitive Rotation Space

Xiang Zhang, Kun Wei, Xu Yang, Chenghao Xu, Su Yan, Cheng Deng

Main category: cs.LG

TL;DR: RCU is a machine unlearning method that uses rotational salience weights and cognitive rotation spaces to enable continuous unlearning without needing retained datasets, preventing catastrophic utility loss.

Details

Motivation: Existing machine unlearning methods for LLMs have two major limitations: they require retained datasets to preserve model utility, and they suffer from cumulative catastrophic utility loss when handling continuous unlearning requests.

Method: Rotation Control Unlearning (RCU) uses rotational salience weights to quantify unlearning degree, creates a cognitive rotation space via skew symmetric loss, and employs orthogonal rotation axes regularization to make continuous unlearning directions mutually perpendicular.

Result: Experiments on multiple datasets show RCU achieves state-of-the-art performance without requiring retained datasets, effectively addressing cumulative catastrophic utility loss.

Conclusion: RCU provides an effective solution for continuous machine unlearning in LLMs that eliminates the need for retained datasets while preventing catastrophic utility degradation through rotational control mechanisms.

Abstract: As Large Language Models (LLMs) become increasingly prevalent, their security vulnerabilities have already drawn attention. Machine unlearning is introduced to seek to mitigate these risks by removing the influence of undesirable data. However, existing methods not only rely on the retained dataset to preserve model utility, but also suffer from cumulative catastrophic utility loss under continuous unlearning requests. To solve this dilemma, we propose a novel method, called Rotation Control Unlearning (RCU), which leverages the rotational salience weight of RCU to quantify and control the unlearning degree in the continuous unlearning process. The skew symmetric loss is designed to construct the existence of the cognitive rotation space, where the changes of rotational angle can simulate the continuous unlearning process. Furthermore, we design an orthogonal rotation axes regularization to enforce mutually perpendicular rotation directions for continuous unlearning requests, effectively minimizing interference and addressing cumulative catastrophic utility loss. Experiments on multiple datasets confirm that our method without retained dataset achieves SOTA performance.

[603] Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance

Zhuo Li, Pengyu Cheng, Zhechao Yu, Feifei Tong, Anningzhe Gao, Tsung-Hui Chang, Xiang Wan, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

Main category: cs.LG

TL;DR: DIR is an information-theoretic method that debiases reward models by maximizing mutual information with human preferences while minimizing mutual information with biased attributes, handling complex non-linear biases better than previous approaches.

Details

Motivation: Reward models in RLHF often suffer from inductive biases in training data (like response length preference) that lead to overfitting and reward hacking. Existing debiasing methods are limited to single biases or simple linear correlations.

Method: Proposes DIR (Debiasing via Information optimization for RM), inspired by information bottleneck theory. Maximizes mutual information between RM scores and human preference pairs while minimizing mutual information between RM outputs and biased attributes of inputs.

Result: DIR effectively mitigates three types of inductive biases (response length, sycophancy, format) and enhances RLHF performance across diverse benchmarks with better generalization abilities.

Conclusion: DIR provides a theoretically grounded approach to handle complex non-linear biases in reward modeling, extending real-world application scenarios for RM debiasing and improving alignment with human values.

Abstract: Reward models (RMs) are essential in reinforcement learning from human feedback (RLHF) to align large language models (LLMs) with human values. However, RM training data is commonly recognized as low-quality, containing inductive biases that can easily lead to overfitting and reward hacking. For example, more detailed and comprehensive responses are usually human-preferred but with more words, leading response length to become one of the inevitable inductive biases. A limited number of prior RM debiasing approaches either target a single specific type of bias or model the problem with only simple linear correlations, \textit{e.g.}, Pearson coefficients. To mitigate more complex and diverse inductive biases in reward modeling, we introduce a novel information-theoretic debiasing method called \textbf{D}ebiasing via \textbf{I}nformation optimization for \textbf{R}M (DIR). Inspired by the information bottleneck (IB), we maximize the mutual information (MI) between RM scores and human preference pairs, while minimizing the MI between RM outputs and biased attributes of preference inputs. With theoretical justification from information theory, DIR can handle more sophisticated types of biases with non-linear correlations, broadly extending the real-world application scenarios for RM debiasing methods. In experiments, we verify the effectiveness of DIR with three types of inductive biases: \textit{response length}, \textit{sycophancy}, and \textit{format}. We discover that DIR not only effectively mitigates target inductive biases but also enhances RLHF performance across diverse benchmarks, yielding better generalization abilities. The code and training recipes are available at https://github.com/Qwen-Applications/DIR.

[604] FRoD: Full-Rank Efficient Fine-Tuning with Rotational Degrees for Fast Convergence

Guoan Wan, Tianyu Chen, Fangzheng Feng, Haoyi Zhou, Runhua Xu

Main category: cs.LG

TL;DR: FRoD is a novel parameter-efficient fine-tuning method that combines hierarchical joint decomposition with rotational degrees of freedom to achieve full-rank expressiveness while using only 1.72% trainable parameters, matching full fine-tuning accuracy on 20 benchmarks.

Details

Motivation: Current PEFT methods like LoRA face limitations due to low-rank constraints, resulting in slow convergence and limited adaptation capacity. This trade-off hampers their ability to capture complex patterns needed for diverse tasks, creating a need for more expressive yet efficient fine-tuning approaches.

Method: FRoD combines hierarchical joint decomposition with rotational degrees of freedom. It extracts a globally shared basis across layers and injects sparse, learnable perturbations into scaling factors to enable flexible full-rank updates, enhancing both expressiveness and efficiency.

Result: On 20 benchmarks spanning vision, reasoning, and language understanding, FRoD matches full model fine-tuning in accuracy while using only 1.72% of trainable parameters under identical training budgets, demonstrating faster and more robust convergence.

Conclusion: FRoD successfully addresses the limitations of existing PEFT methods by achieving full-rank expressiveness with minimal parameter updates, offering a practical solution for efficient adaptation of large foundation models to diverse downstream tasks.

Abstract: Parameter-efficient fine-tuning (PEFT) methods have emerged as a practical solution for adapting large foundation models to downstream tasks, reducing computational and memory costs by updating only a small subset of parameters. Among them, approaches like LoRA aim to strike a balance between efficiency and expressiveness, but often suffer from slow convergence and limited adaptation capacity due to their inherent low-rank constraints. This trade-off hampers the ability of PEFT methods to capture complex patterns needed for diverse tasks. To address these challenges, we propose FRoD, a novel fine-tuning method that combines hierarchical joint decomposition with rotational degrees of freedom. By extracting a globally shared basis across layers and injecting sparse, learnable perturbations into scaling factors for flexible full-rank updates, FRoD enhances expressiveness and efficiency, leading to faster and more robust convergence. On 20 benchmarks spanning vision, reasoning, and language understanding, FRoD matches full model fine-tuning in accuracy, while using only 1.72% of trainable parameters under identical training budgets.

[605] ML Compass: Navigating Capability, Cost, and Compliance Trade-offs in AI Model Deployment

Vassilis Digalakis, Ramayya Krishnan, Gonzalo Martin Fernandez, Agni Orfanoudaki

Main category: cs.LG

TL;DR: ML Compass: A framework for AI model selection that considers user utility, deployment costs, and compliance requirements, bridging the gap between capability leaderboards and actual deployment decisions.

Details

Motivation: Capability leaderboards don't translate well to deployment decisions, creating a "capability-deployment gap." Organizations need to consider application outcomes, operating constraints, and the capability-cost frontier when choosing AI models.

Method: Develop ML Compass framework treating model selection as constrained optimization over capability-cost frontier. Theoretical analysis characterizes optimal configurations under parametric frontier. Implementation pipeline extracts internal measures, estimates empirical frontier, learns task-specific utility functions, and recommends models.

Result: Framework produces recommendations and deployment-aware leaderboards that differ from capability-only rankings. Validated with conversational (PRISM Alignment) and healthcare (HealthBench) case studies, showing how trade-offs between capability, cost, and safety shape optimal model choice.

Conclusion: ML Compass bridges the capability-deployment gap by providing a systematic approach to AI model selection that accounts for real-world constraints, enabling organizations to make better deployment decisions based on comprehensive utility-cost-compliance trade-offs.

Abstract: We study how organizations should select among competing AI models when user utility, deployment costs, and compliance requirements jointly matter. Widely used capability leaderboards do not translate directly into deployment decisions, creating a capability – deployment gap; to bridge it, we take a systems-level view in which model choice is tied to application outcomes, operating constraints, and a capability-cost frontier. We develop ML Compass, a framework that treats model selection as constrained optimization over this frontier. On the theory side, we characterize optimal model configurations under a parametric frontier and show a three-regime structure in optimal internal measures: some dimensions are pinned at compliance minima, some saturate at maximum levels, and the remainder take interior values governed by frontier curvature. We derive comparative statics that quantify how budget changes, regulatory tightening, and technological progress propagate across capability dimensions and costs. On the implementation side, we propose a pipeline that (i) extracts low-dimensional internal measures from heterogeneous model descriptors, (ii) estimates an empirical frontier from capability and cost data, (iii) learns a user- or task-specific utility function from interaction outcome data, and (iv) uses these components to target capability-cost profiles and recommend models. We validate ML Compass with two case studies: a general-purpose conversational setting using the PRISM Alignment dataset and a healthcare setting using a custom dataset we build using HealthBench. In both environments, our framework produces recommendations – and deployment-aware leaderboards based on predicted deployment value under constraints – that can differ materially from capability-only rankings, and clarifies how trade-offs between capability, cost, and safety shape optimal model choice.

[606] Joint Link Adaptation and Device Scheduling Approach for URLLC Industrial IoT Network: A DRL-based Method with Bayesian Optimization

Wei Gao, Paul Zheng, Peng Wu, Yulin Hu, Anke Schmeink

Main category: cs.LG

TL;DR: A BO-driven TD3 method for joint link adaptation and device scheduling in IIoT URLLC networks with imperfect CSI, achieving faster convergence and higher sum-rate.

Details

Motivation: Industrial IoT networks need to support multi-device dynamic URLLC with strict reliability requirements, but face challenges from imperfect CSI, error sample imbalance, and algorithm sensitivity that reduce convergence speed and reliability.

Method: Proposes a Bayesian optimization-driven Twin Delayed Deep Deterministic Policy Gradient (BO-TD3) method that jointly optimizes device serving order sequence and modulation/coding scheme based on imperfect CSI. Uses BO-based training mechanism to improve convergence speed with reliable learning direction and sample selection.

Result: The proposed algorithm achieves faster convergence and higher sum-rate performance compared to existing solutions, as demonstrated through extensive simulations.

Conclusion: The BO-driven TD3 approach effectively addresses the challenges of imperfect CSI and sample imbalance in IIoT URLLC networks, providing improved joint link adaptation and device scheduling for multi-device dynamic communication.

Abstract: In this article, we consider an industrial internet of things (IIoT) network supporting multi-device dynamic ultra-reliable low-latency communication (URLLC) while the channel state information (CSI) is imperfect. A joint link adaptation (LA) and device scheduling (including the order) design is provided, aiming at maximizing the total transmission rate under strict block error rate (BLER) constraints. In particular, a Bayesian optimization (BO) driven Twin Delayed Deep Deterministic Policy Gradient (TD3) method is proposed, which determines the device served order sequence and the corresponding modulation and coding scheme (MCS) adaptively based on the imperfect CSI. Note that the imperfection of CSI, error sample imbalance in URLLC networks, as well as the parameter sensitivity nature of the TD3 algorithm likely diminish the algorithm’s convergence speed and reliability. To address such an issue, we proposed a BO based training mechanism for the convergence speed improvement, which provides a more reliable learning direction and sample selection method to track the imbalance sample problem. Via extensive simulations, we show that the proposed algorithm achieves faster convergence and higher sum-rate performance compared to existing solutions.

[607] Trustworthy Machine Learning under Distribution Shifts

Zhuo Huang

Main category: cs.LG

TL;DR: This paper focuses on Trustworthy Machine Learning under Distribution Shifts, addressing three common distribution shift types (Perturbation, Domain, and Modality Shifts) through three trustworthiness aspects (Robustness, Explainability, Adaptability) to enhance AI reliability and usefulness.

Details

Motivation: Despite AI's impressive achievements and scaling laws enabling general intelligence, distribution shift remains a fundamental limitation that undermines ML system reliability and usefulness, while also causing trust issues for AI systems.

Method: The research systematically studies three distribution shift scenarios: Perturbation Shift, Domain Shift, and Modality Shift. For each scenario, trustworthiness is investigated through three dimensions: Robustness, Explainability, and Adaptability. The approach involves proposing effective solutions and fundamental insights to address these challenges.

Result: The paper presents solutions and insights for enhancing trustworthy ML under distribution shifts, aiming to improve critical ML problems such as efficiency, adaptability, and safety across different shift scenarios.

Conclusion: Distribution shift is a critical challenge limiting AI reliability and trustworthiness. By systematically addressing three types of distribution shifts through robustness, explainability, and adaptability dimensions, the research aims to expand AI’s reliability, versatility, and responsibility for real-world applications.

Abstract: Machine Learning (ML) has been a foundational topic in artificial intelligence (AI), providing both theoretical groundwork and practical tools for its exciting advancements. From ResNet for visual recognition to Transformer for vision-language alignment, the AI models have achieved superior capability to humans. Furthermore, the scaling law has enabled AI to initially develop general intelligence, as demonstrated by Large Language Models (LLMs). To this stage, AI has had an enormous influence on society and yet still keeps shaping the future for humanity. However, distribution shift remains a persistent ``Achilles’ heel’’, fundamentally limiting the reliability and general usefulness of ML systems. Moreover, generalization under distribution shift would also cause trust issues for AIs. Motivated by these challenges, my research focuses on \textit{Trustworthy Machine Learning under Distribution Shifts}, with the goal of expanding AI’s robustness, versatility, as well as its responsibility and reliability. We carefully study the three common distribution shifts into: (1) Perturbation Shift, (2) Domain Shift, and (3) Modality Shift. For all scenarios, we also rigorously investigate trustworthiness via three aspects: (1) Robustness, (2) Explainability, and (3) Adaptability. Based on these dimensions, we propose effective solutions and fundamental insights, meanwhile aiming to enhance the critical ML problems, such as efficiency, adaptability, and safety.

[608] EEG-based Graph-guided Domain Adaptation for Robust Cross-Session Emotion Recognition

Maryam Mirzaei, Farzaneh Shayegh, Hamed Narimani

Main category: cs.LG

TL;DR: EGDA framework improves EEG-based emotion recognition across sessions by aligning global and class-specific distributions with graph regularization, achieving over 80% accuracy on SEED-IV dataset.

Details

Motivation: EEG is reliable for emotion recognition but suffers from cross-session variability that hinders model generalization. Need to address session-to-session distribution shifts while preserving EEG data structure.

Method: Propose EGDA framework that jointly aligns global (marginal) and class-specific (conditional) distributions across sessions, while using graph regularization to preserve intrinsic EEG data structure.

Result: Achieves robust cross-session performance on SEED-IV dataset with accuracies of 81.22%, 80.15%, and 83.27% across three transfer tasks, outperforming baseline methods. Gamma band and central-parietal/prefrontal regions identified as most discriminative.

Conclusion: EGDA effectively reduces cross-session discrepancies in EEG-based emotion recognition through joint distribution alignment and structural preservation, demonstrating practical value for reliable human-machine interaction systems.

Abstract: Accurate recognition of human emotional states is critical for effective human-machine interaction. Electroencephalography (EEG) offers a reliable source for emotion recognition due to its high temporal resolution and its direct reflection of neural activity. Nevertheless, variations across recording sessions present a major challenge for model generalization. To address this issue, we propose EGDA, a framework that reduces cross-session discrepancies by jointly aligning the global (marginal) and class-specific (conditional) distributions, while preserving the intrinsic structure of EEG data through graph regularization. Experimental results on the SEED-IV dataset demonstrate that EGDA achieves robust cross-session performance, obtaining accuracies of 81.22%, 80.15%, and 83.27% across three transfer tasks, and surpassing several baseline methods. Furthermore, the analysis highlights the Gamma frequency band as the most discriminative and identifies the central-parietal and prefrontal brain regions as critical for reliable emotion recognition.

[609] Distribution-Free Process Monitoring with Conformal Prediction

Christopher Burger

Main category: cs.LG

TL;DR: Hybrid framework integrates Conformal Prediction with Statistical Process Control to overcome statistical assumption limitations, providing more robust quality monitoring with uncertainty visualization and formal anomaly detection.

Details

Motivation: Traditional SPC is limited by reliance on often-violated statistical assumptions, making it unreliable for modern complex manufacturing environments that require more robust quality control methods.

Method: Proposes a hybrid framework integrating Conformal Prediction with SPC, featuring two novel applications: 1) Conformal-Enhanced Control Charts that visualize process uncertainty with ‘uncertainty spikes’, and 2) Conformal-Enhanced Process Monitoring that reframes multivariate control as formal anomaly detection using p-value charts.

Result: The framework provides more robust and statistically rigorous quality control while maintaining the interpretability and ease of use of classic SPC methods, offering distribution-free, model-agnostic guarantees.

Conclusion: The hybrid Conformal Prediction-SPC framework enhances traditional quality control by addressing statistical assumption limitations, enabling proactive monitoring through uncertainty visualization and formal anomaly detection in complex manufacturing environments.

Abstract: Traditional Statistical Process Control (SPC) is essential for quality management but is limited by its reliance on often violated statistical assumptions, leading to unreliable monitoring in modern, complex manufacturing environments. This paper introduces a hybrid framework that enhances SPC by integrating the distribution free, model agnostic guarantees of Conformal Prediction. We propose two novel applications: Conformal-Enhanced Control Charts, which visualize process uncertainty and enable proactive signals like ‘uncertainty spikes’, and Conformal-Enhanced Process Monitoring, which reframes multivariate control as a formal anomaly detection problem using an intuitive p-value chart. Our framework provides a more robust and statistically rigorous approach to quality control while maintaining the interpretability and ease of use of classic methods.

[610] Le Cam Distortion: A Decision-Theoretic Framework for Robust Transfer Learning

Deniz Akdemir

Main category: cs.LG

TL;DR: The paper critiques standard Unsupervised Domain Adaptation (UDA) for causing negative transfer when domains are unequally informative, and proposes Le Cam Distortion framework using directional simulability instead of symmetric invariance for risk-controlled transfer learning.

Details

Motivation: Standard UDA approaches enforce feature invariance between source and target domains, but this is fundamentally flawed when domains are unequally informative (e.g., high-quality vs degraded sensors). Strict invariance requires information destruction, leading to catastrophic negative transfer in safety-critical applications like medical imaging and autonomous systems.

Method: Proposes a decision-theoretic framework based on Le Cam’s theory of statistical experiments. Replaces symmetric invariance with directional simulability using constructive approximations. Introduces Le Cam Distortion quantified by Deficiency Distance δ(E₁, E₂) as a rigorous upper bound for transfer risk. Learns a kernel that simulates the target from the source without degrading source information.

Result: Across five experiments: (1) Near-perfect frequency estimation in HLA genomics (r=0.999 correlation), (2) Zero source utility loss in CIFAR-10 classification (81.2% accuracy preserved vs 34.7% drop for CycleGAN), (3) Safe policy transfer in RL control where invariance-based methods suffer catastrophic collapse.

Conclusion: Le Cam Distortion provides the first principled framework for risk-controlled transfer learning in domains where negative transfer is unacceptable (medical imaging, autonomous systems, precision medicine), enabling safe transfer without source degradation.

Abstract: Distribution shift is the defining challenge of real-world machine learning. The dominant paradigm–Unsupervised Domain Adaptation (UDA)–enforces feature invariance, aligning source and target representations via symmetric divergence minimization [Ganin et al., 2016]. We demonstrate that this approach is fundamentally flawed: when domains are unequally informative (e.g., high-quality vs degraded sensors), strict invariance necessitates information destruction, causing “negative transfer” that can be catastrophic in safety-critical applications [Wang et al., 2019]. We propose a decision-theoretic framework grounded in Le Cam’s theory of statistical experiments [Le Cam, 1986], using constructive approximations to replace symmetric invariance with directional simulability. We introduce Le Cam Distortion, quantified by the Deficiency Distance $δ(E_1, E_2)$, as a rigorous upper bound for transfer risk conditional on simulability. Our framework enables transfer without source degradation by learning a kernel that simulates the target from the source. Across five experiments (genomics, vision, reinforcement learning), Le Cam Distortion achieves: (1) near-perfect frequency estimation in HLA genomics (correlation $r=0.999$, matching classical methods), (2) zero source utility loss in CIFAR-10 image classification (81.2% accuracy preserved vs 34.7% drop for CycleGAN), and (3) safe policy transfer in RL control where invariance-based methods suffer catastrophic collapse. Le Cam Distortion provides the first principled framework for risk-controlled transfer learning in domains where negative transfer is unacceptable: medical imaging, autonomous systems, and precision medicine.

[611] BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization

Iris Xu, Guangtao Zeng, Zexue He, Charles Jin, Aldo Pareja, Dan Gutfreund, Chuang Gan, Zhang-Wei Hong

Main category: cs.LG

TL;DR: BOAD proposes an automated multi-agent system for software engineering tasks using bandit optimization to discover effective agent hierarchies, outperforming single-agent and manually designed systems on challenging SWE benchmarks.

Details

Motivation: LLMs struggle with real-world software engineering problems that are long-horizon and out-of-distribution. Existing single-agent systems force models to retain irrelevant context, leading to poor generalization. Human engineers decompose complex problems, motivating the need for specialized sub-agents.

Method: Proposes Bandit Optimization for Agent Design (BOAD), formulating hierarchy discovery as a multi-armed bandit problem where each arm represents a candidate sub-agent. The reward measures helpfulness when collaborating with others, enabling efficient exploration of sub-agent designs under limited evaluation budgets.

Result: On SWE-bench-Verified, BOAD outperforms single-agent and manually designed multi-agent systems. On SWE-bench-Live, their 36B system ranks second on the leaderboard, surpassing larger models like GPT-4 and Claude.

Conclusion: Automatically discovered hierarchical multi-agent systems significantly improve generalization on challenging long-horizon software engineering tasks, demonstrating the effectiveness of structured agent coordination over monolithic designs.

Abstract: Large language models (LLMs) have shown strong reasoning and coding capabilities, yet they struggle to generalize to real-world software engineering (SWE) problems that are long-horizon and out of distribution. Existing systems often rely on a single agent to handle the entire workflow-interpreting issues, navigating large codebases, and implementing fixes-within one reasoning chain. Such monolithic designs force the model to retain irrelevant context, leading to spurious correlations and poor generalization. Motivated by how human engineers decompose complex problems, we propose structuring SWE agents as orchestrators coordinating specialized sub-agents for sub-tasks such as localization, editing, and validation. The challenge lies in discovering effective hierarchies automatically: as the number of sub-agents grows, the search space becomes combinatorial, and it is difficult to attribute credit to individual sub-agents within a team. We address these challenges by formulating hierarchy discovery as a multi-armed bandit (MAB) problem, where each arm represents a candidate sub-agent and the reward measures its helpfulness when collaborating with others. This framework, termed Bandit Optimization for Agent Design (BOAD), enables efficient exploration of sub-agent designs under limited evaluation budgets. On SWE-bench-Verified, BOAD outperforms single-agent and manually designed multi-agent systems. On SWE-bench-Live, featuring more recent and out-of-distribution issues, our 36B system ranks second on the leaderboard at the time of evaluation, surpassing larger models such as GPT-4 and Claude. These results demonstrate that automatically discovered hierarchical multi-agent systems significantly improve generalization on challenging long-horizon SWE tasks. Code is available at https://github.com/iamxjy/BOAD-SWE-Agent.

[612] Random Controlled Differential Equations

Francesco Piatti, Thomas Cass, William F. Turner

Main category: cs.LG

TL;DR: A training-efficient framework for time-series learning using random features with controlled differential equations as continuous-time reservoirs, with two variants: Random Fourier CDEs and Random Rough DEs that approximate signature kernels.

Details

Motivation: To create efficient time-series learning models that combine the inductive bias of signature methods with the computational efficiency of random features, avoiding expensive explicit signature computations while maintaining strong theoretical foundations.

Method: Two main approaches: (1) Random Fourier CDEs (RF-CDEs) that lift input signals using random Fourier features before controlled differential equation dynamics, approximating RBF-enhanced sequence models; (2) Random Rough DEs (R-RDEs) that operate directly on rough-path inputs via log-ODE discretization using log-signatures to capture higher-order temporal interactions.

Result: Theoretical proof that in infinite-width limit, models induce RBF-lifted signature kernel and rough signature kernel respectively; empirical evaluation shows competitive or state-of-the-art performance across time-series benchmarks.

Conclusion: The framework provides practical alternatives to explicit signature computations, retaining signature methods’ inductive bias while benefiting from random features’ efficiency, offering unified perspective on random-feature reservoirs, continuous-time architectures, and path-signature theory.

Abstract: We introduce a training-efficient framework for time-series learning that combines random features with controlled differential equations (CDEs). In this approach, large randomly parameterized CDEs act as continuous-time reservoirs, mapping input paths to rich representations. Only a linear readout layer is trained, resulting in fast, scalable models with strong inductive bias. Building on this foundation, we propose two variants: (i) Random Fourier CDEs (RF-CDEs): these lift the input signal using random Fourier features prior to the dynamics, providing a kernel-free approximation of RBF-enhanced sequence models; (ii) Random Rough DEs (R-RDEs): these operate directly on rough-path inputs via a log-ODE discretization, using log-signatures to capture higher-order temporal interactions while remaining stable and efficient. We prove that in the infinite-width limit, these model induces the RBF-lifted signature kernel and the rough signature kernel, respectively, offering a unified perspective on random-feature reservoirs, continuous-time deep architectures, and path-signature theory. We evaluate both models across a range of time-series benchmarks, demonstrating competitive or state-of-the-art performance. These methods provide a practical alternative to explicit signature computations, retaining their inductive bias while benefiting from the efficiency of random features.

[613] End-to-End Test-Time Training for Long Context

Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin, Jed McCaleb, Yejin Choi, Yu Sun

Main category: cs.LG

TL;DR: TTT-E2E: A test-time training approach for long-context language modeling that uses standard Transformers with sliding-window attention, learns at test time via next-token prediction, and achieves full-attention scaling with constant inference latency.

Details

Motivation: The paper addresses long-context language modeling by reframing it as a continual learning problem rather than an architecture design challenge. Current approaches often require specialized architectures, but the authors propose using standard Transformers with a different learning paradigm.

Method: Uses a standard Transformer with sliding-window attention. At test time, the model continues learning via next-token prediction on the given context, compressing context into weights. At training time, meta-learning improves initialization for test-time learning. This end-to-end test-time training approach contrasts with previous methods.

Result: For 3B models trained with 164B tokens, TTT-E2E scales with context length similarly to Transformers with full attention, outperforming Mamba 2 and Gated DeltaNet. It achieves constant inference latency regardless of context length (2.7x faster than full attention for 128K context).

Conclusion: Test-time training with meta-learning enables standard Transformers to handle long contexts effectively, achieving full-attention scaling performance with constant inference latency, offering a promising alternative to specialized long-context architectures.

Abstract: We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture – a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model’s initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available.

[614] CarSpeedNet: Learning-Based Speed Estimation from Accelerometer-Only Inertial Sensing

Barak Or

Main category: cs.LG

TL;DR: CarSpeedNet: A learning-based framework that estimates vehicle speed using only raw accelerometer data from a smartphone, without needing gyroscopes, wheel encoders, or external positioning.

Details

Motivation: Traditional velocity estimation methods rely on wheel encoders, inertial navigation units, or multi-sensor fusion, which may not be available or reliable in low-cost, redundancy-constrained, or degraded operational scenarios where sensors can fail, drift, or become temporarily unavailable.

Method: CarSpeedNet uses a learning-based inertial estimation framework that infers speed directly from raw accelerometer measurements. Instead of explicitly estimating physical states like orientation or sensor bias, it performs implicit latent-state approximation from temporal accelerometer data.

Result: The paper investigates the feasibility of estimating vehicle speed using only a single low-cost inertial sensor (a three-axis accelerometer embedded in a commodity smartphone), representing an extreme case of sensing sparsity where classical integration-based or filter-based approaches become unstable.

Conclusion: The proposed approach addresses the limitations of conventional velocity estimation systems by enabling speed estimation in scenarios where traditional sensor configurations are unavailable or unreliable, using only accelerometer data through a learning-based framework.

Abstract: Velocity estimation is a core component of state estimation and sensor fusion pipelines in mobile robotics and autonomous ground systems, directly affecting navigation accuracy, control stability, and operational safety. In conventional systems, velocity is obtained through wheel encoders, inertial navigation units, or tightly coupled multi-sensor fusion architectures. However, these sensing configurations are not always available or reliable, particularly in low-cost, redundancy-constrained, or degraded operational scenarios where sensors may fail, drift, or become temporarily unavailable. This paper investigates the feasibility of estimating vehicle speed using only a single low-cost inertial sensor: a three-axis accelerometer embedded in a commodity smartphone. We present CarSpeedNet, a learning-based inertial estimation framework designed to infer speed directly from raw accelerometer measurements, without access to gyroscopes, wheel odometry, vehicle bus data, or external positioning during inference. From a sensor fusion perspective, this setting represents an extreme case of sensing sparsity, in which classical integration-based or filter-based approaches become unstable due to bias accumulation and partial observability. Rather than explicitly estimating physical states such as orientation or sensor bias, the proposed approach performs implicit latent-state approximation from temporal accelerometer data.

[615] Application-Driven Innovation in Machine Learning

David Rolnick, Alan Aspuru-Guzik, Sara Beery, Bistra Dilkina, Priya L. Donti, Marzyeh Ghassemi, Hannah Kerner, Claire Monteleoni, Esther Rolf, Milind Tambe, Adam White

Main category: cs.LG

TL;DR: The paper argues that application-driven ML research is undervalued compared to methods-driven work, despite offering significant impact potential for both applications and ML theory itself.

Details

Motivation: Application-driven ML research is systematically undervalued in the ML community despite its growing importance as ML applications proliferate. Such work can inspire innovative algorithms and have significant impact both in application domains and in advancing ML theory itself.

Method: Position paper approach: describes the paradigm of application-driven research in ML, contrasts it with methods-driven research, illustrates benefits and synergies, and analyzes how current academic practices (reviewing, hiring, teaching) hinder application-driven innovation.

Result: The paper identifies that current ML community practices in reviewing, hiring, and teaching systematically disadvantage application-driven research, despite its potential for significant impact and productive synergy with methods-driven work.

Conclusion: The ML community needs to reform its practices to better value and support application-driven research, which can productively synergize with methods-driven work and drive innovation inspired by real-world challenges.

Abstract: In this position paper, we argue that application-driven research has been systemically under-valued in the machine learning community. As applications of machine learning proliferate, innovative algorithms inspired by specific real-world challenges have become increasingly important. Such work offers the potential for significant impact not merely in domains of application but also in machine learning itself. In this paper, we describe the paradigm of application-driven research in machine learning, contrasting it with the more standard paradigm of methods-driven research. We illustrate the benefits of application-driven machine learning and how this approach can productively synergize with methods-driven work. Despite these benefits, we find that reviewing, hiring, and teaching practices in machine learning often hold back application-driven innovation. We outline how these processes may be improved.

[616] Aligning Agents like Large Language Models

Adam Jelley, Yuhan Cao, Dave Bignell, Amos Storkey, Sam Devlin, Tabish Rashid

Main category: cs.LG

TL;DR: This position paper proposes training decision-making agents like Large Language Models (LLMs) to achieve more general, robust, and aligned behaviors in complex 3D environments.

Details

Motivation: Reinforcement learning for agents in complex 3D environments requires carefully designed reward functions and struggles with generalization, while LLMs demonstrate impressive general capabilities but struggle with action in complex environments. The paper aims to bridge this gap by applying LLM training methodologies to agents.

Method: The authors draw explicit analogies between decision-making agents and LLMs, then provide a proof-of-concept demonstrating how the LLM training pipeline (large-scale pre-training and post-training alignment) can be applied to train an agent in a 3D video game environment from pixels. They investigate the importance of each stage of the LLM training pipeline.

Result: The paper provides guidance and insights for successfully applying LLM training approaches to agents, demonstrating the feasibility of this approach through their proof-of-concept implementation in a 3D video game environment.

Conclusion: This work offers an alternative perspective to contemporary LLM Agents, suggesting that leveraging LLM training methodologies for decision-making agents could illuminate a path toward developing more generally capable agents for video games and beyond.

Abstract: Training agents to act competently in complex 3D environments from high-dimensional visual information is challenging. Reinforcement learning is conventionally used to train such agents, but requires a carefully designed reward function, and is difficult to scale to obtain robust agents that generalize to new tasks. In contrast, Large Language Models (LLMs) demonstrate impressively general capabilities resulting from large-scale pre-training and post-training alignment, but struggle to act in complex environments. This position paper draws explicit analogies between decision-making agents and LLMs, and argues that agents should be trained like LLMs to achieve more general, robust, and aligned behaviors. We provide a proof-of-concept to demonstrate how the procedure for training LLMs can be used to train an agent in a 3D video game environment from pixels. We investigate the importance of each stage of the LLM training pipeline, while providing guidance and insights for successfully applying this approach to agents. Our paper provides an alternative perspective to contemporary LLM Agents on how recent progress in LLMs can be leveraged for decision-making agents, and we hope will illuminate a path towards developing more generally capable agents for video games and beyond. Project summary and videos: https://adamjelley.github.io/aligning-agents-like-llms .

[617] Fair Class-Incremental Learning using Sample Weighting

Jaeyoung Park, Minsu Kim, Steven Euijong Whang

Main category: cs.LG

TL;DR: The paper proposes a fairness-aware sample weighting (FSW) algorithm for class-incremental learning to address unfair catastrophic forgetting in sensitive groups, achieving better accuracy-fairness tradeoffs.

Details

Motivation: Model fairness is becoming crucial in Trustworthy AI, but fairness has been understudied in class-incremental learning. Naive training causes unfair catastrophic forgetting for certain sensitive groups, where some groups experience disproportionate forgetting compared to others.

Method: The paper theoretically analyzes that forgetting occurs when average gradient vectors of current task data and sensitive groups have negative inner products. It proposes a fairness-aware sample weighting (FSW) framework that adjusts training weights to change gradient directions, reducing forgetting of underperforming groups. For various fairness measures, it formulates optimization problems to minimize overall losses while reducing disparities, solved via linear programming.

Result: Experiments show that FSW achieves better accuracy-fairness tradeoff results than state-of-the-art approaches on real datasets, demonstrating effectiveness in balancing accuracy and fairness in class-incremental learning.

Conclusion: The paper introduces a novel fairness-aware approach to class-incremental learning that addresses unfair catastrophic forgetting through gradient direction analysis and sample weighting optimization, advancing Trustworthy AI in continual learning scenarios.

Abstract: Model fairness is becoming important in class-incremental learning for Trustworthy AI. While accuracy has been a central focus in class-incremental learning, fairness has been relatively understudied. However, naively using all the samples of the current task for training results in unfair catastrophic forgetting for certain sensitive groups including classes. We theoretically analyze that forgetting occurs if the average gradient vector of the current task data is in an “opposite direction” compared to the average gradient vector of a sensitive group, which means their inner products are negative. We then propose a fair class-incremental learning framework that adjusts the training weights of current task samples to change the direction of the average gradient vector and thus reduce the forgetting of underperforming groups and achieve fairness. For various group fairness measures, we formulate optimization problems to minimize the overall losses of sensitive groups while minimizing the disparities among them. We also show the problems can be solved with linear programming and propose an efficient Fairness-aware Sample Weighting (FSW) algorithm. Experiments show that FSW achieves better accuracy-fairness tradeoff results than state-of-the-art approaches on real datasets.

[618] Trust-free Personalized Decentralized Learning

Yawen Li, Yan Li, Junping Du, Yingxia Shao, Meiyu Liang, Guanhua Ye

Main category: cs.LG

TL;DR: TPFed is a trust-free personalized decentralized federated learning framework that uses blockchain and LSH for secure peer selection and knowledge distillation without exposing local data.

Details

Motivation: There's a critical trade-off between customization and participant trust in personalized federated learning. Existing approaches rely on centralized coordinators or trusted peer groups, limiting applicability in open, trust-averse environments. Decentralized methods lack global scalability and robust mechanisms against malicious peers.

Method: TPFed replaces central aggregators with a blockchain-based bulletin board. Participants dynamically select global communication partners using Locality-Sensitive Hashing (LSH) and peer ranking. An “all-in-one” knowledge distillation protocol handles knowledge transfer, model quality evaluation, and similarity verification via a public reference dataset.

Result: Extensive experiments show TPFed significantly outperforms traditional federated baselines in both learning accuracy and system robustness against adversarial attacks.

Conclusion: TPFed enables secure, globally personalized collaboration without exposing local models or data, bridging the gap between customization and trust in decentralized federated learning environments.

Abstract: Personalized collaborative learning in federated settings faces a critical trade-off between customization and participant trust. Existing approaches typically rely on centralized coordinators or trusted peer groups, limiting their applicability in open, trust-averse environments. While recent decentralized methods explore anonymous knowledge sharing, they often lack global scalability and robust mechanisms against malicious peers. To bridge this gap, we propose TPFed, a \textit{Trust-free Personalized Decentralized Federated Learning} framework. TPFed replaces central aggregators with a blockchain-based bulletin board, enabling participants to dynamically select global communication partners based on Locality-Sensitive Hashing (LSH) and peer ranking. Crucially, we introduce an ``all-in-one’’ knowledge distillation protocol that simultaneously handles knowledge transfer, model quality evaluation, and similarity verification via a public reference dataset. This design ensures secure, globally personalized collaboration without exposing local models or data. Extensive experiments demonstrate that TPFed significantly outperforms traditional federated baselines in both learning accuracy and system robustness against adversarial attacks.

[619] Constraint Decoupled Latent Diffusion for Protein Backmapping

Xu Han, Yuancheng Sun, Kai Chen, Yuxuan Ren, Kang Liu, Qiwei Ye

Main category: cs.LG

TL;DR: CODLAD is a two-stage framework for reconstructing atomic details from coarse-grained protein structures using constraint-decoupled latent diffusion, achieving state-of-the-art accuracy, diversity, and efficiency.

Details

Motivation: Current backmapping approaches for reconstructing atomic details from coarse-grained protein structures face a trade-off between atomistic accuracy and conformational diversity, often requiring complex constraint handling or extensive refinement steps.

Method: Two-stage framework: (1) compresses atomic structures into discrete latent representations with explicit structural constraints, decoupling constraint handling from generation; (2) performs efficient denoising diffusion in latent space to produce structurally valid and diverse all-atom conformations.

Result: Comprehensive evaluations show CODLAD achieves state-of-the-art performance in atomistic accuracy, conformational diversity, and computational efficiency with strong generalization across different protein systems.

Conclusion: CODLAD provides an effective solution to the backmapping problem by decoupling structural constraints from generation through latent diffusion, enabling high-quality reconstruction of atomic details from coarse-grained protein structures.

Abstract: Coarse-grained (CG) molecular dynamics simulations enable efficient exploration of protein conformational ensembles. However, reconstructing atomic details from CG structures (backmapping) remains a challenging problem. Current approaches face an inherent trade-off between maintaining atomistic accuracy and exploring diverse conformations, often necessitating complex constraint handling or extensive refinement steps. To address these challenges, we introduce a novel two-stage framework, named CODLAD (COnstraint Decoupled LAtent Diffusion). This framework first compresses atomic structures into discrete latent representations, explicitly embedding structural constraints, thereby decoupling constraint handling from generation. Subsequently, it performs efficient denoising diffusion in this latent space to produce structurally valid and diverse all-atom conformations. Comprehensive evaluations on diverse protein datasets demonstrate that CODLAD achieves state-of-the-art performance in atomistic accuracy, conformational diversity, and computational efficiency while exhibiting strong generalization across different protein systems. Code is available at https://github.com/xiaoxiaokuye/CODLAD.

[620] Machine Unlearning using Forgetting Neural Networks

Amartya Hatua, Trung T. Nguyen, Filip Cano, Andrew H. Sung

Main category: cs.LG

TL;DR: First concrete implementation of Forgetting Neural Networks (FNNs) for targeted machine unlearning, using neuroscience-inspired multiplicative decay factors to systematically erase specific training data while preserving model performance.

Details

Motivation: Modern ML systems store vast personal data, creating privacy risks. There's a need for models to forget specific training data for privacy compliance and user trust, but existing unlearning methods lack interpretability and efficiency.

Method: Implements Forgetting Neural Networks with multiplicative decay factors inspired by neuroscience. Proposes variants with per-neuron forgetting factors, including rank-based assignments guided by activation levels. Evaluates on MNIST and Fashion-MNIST benchmarks.

Result: Successfully removes information associated with forget sets while preserving performance on retained data. Membership inference attacks confirm effective erasure of training data information. FNNs outperform existing methods in targeted unlearning.

Conclusion: FNNs establish a promising foundation for efficient and interpretable machine unlearning, offering a neuroscience-inspired approach that balances privacy preservation with model utility.

Abstract: Modern computer systems store vast amounts of personal data, enabling advances in AI and ML but risking user privacy and trust. For privacy reasons, it is sometimes desired for an ML model to forget part of the data it was trained on. In this paper, we introduce a novel unlearning approach based on Forgetting Neural Networks (FNNs), a neuroscience-inspired architecture that explicitly encodes forgetting through multiplicative decay factors. While FNNs had previously been studied as a theoretical construct, we provide the first concrete implementation and demonstrate their effectiveness for targeted unlearning. We propose several variants with per-neuron forgetting factors, including rank-based assignments guided by activation levels, and evaluate them on MNIST and Fashion-MNIST benchmarks. Our method systematically removes information associated with forget sets while preserving performance on retained data. Membership inference attacks confirm the effectiveness of FNN-based unlearning in erasing information about the training data from the neural network. These results establish FNNs as a promising foundation for efficient and interpretable unlearning.

[621] PearSAN: A Machine Learning Method for Inverse Design using Pearson Correlated Surrogate Annealing

Michael Bezick, Blake A. Wilson, Vaishnavi Iyer, Yuheng Chen, Vladimir M. Shalaev, Sabre Kais, Alexander V. Kildishev, Alexandra Boltasseva, Brad Lackey

Main category: cs.LG

TL;DR: PearSAN is a machine learning-assisted optimization algorithm for inverse design with large design spaces, using generative model latent spaces and Pearson correlation-based surrogate modeling to achieve 97% efficiency and order-of-magnitude speed improvements.

Details

Motivation: Traditional optimizers struggle with inverse design problems in large design spaces, particularly in applications like thermophotovoltaic metasurface design where matching working bands between thermal radiators and photovoltaic cells requires efficient optimization.

Method: PearSAN leverages the latent space of pretrained generative models (compatible with VQ-VAEs and binary autoencoders) for rapid sampling and employs a novel Pearson correlated surrogate model to predict the figure of merit. The Pearson correlational loss serves both as latent regularization (similar to batch/layer normalization) and as surrogate training loss.

Result: Achieves state-of-the-art maximum design efficiency of 97%, is at least an order of magnitude faster than previous methods, and shows improved maximum figure-of-merit gain. Outperforms previous energy matching losses which enforce poor regularization and performance.

Conclusion: PearSAN provides an effective machine learning-assisted optimization framework for inverse design problems with large design spaces, demonstrating superior efficiency, speed, and performance compared to existing methods through its novel Pearson correlation approach and generative model integration.

Abstract: PearSAN is a machine learning-assisted optimization algorithm applicable to inverse design problems with large design spaces, where traditional optimizers struggle. The algorithm leverages the latent space of a generative model for rapid sampling and employs a Pearson correlated surrogate model to predict the figure of merit of the true design metric. As a showcase example, PearSAN is applied to thermophotovoltaic (TPV) metasurface design by matching the working bands between a thermal radiator and a photovoltaic cell. PearSAN can work with any pretrained generative model with a discretized latent space, making it easy to integrate with VQ-VAEs and binary autoencoders. Its novel Pearson correlational loss can be used as both a latent regularization method, similar to batch and layer normalization, and as a surrogate training loss. We compare both to previous energy matching losses, which are shown to enforce poor regularization and performance, even with upgraded affine parameters. PearSAN achieves a state-of-the-art maximum design efficiency of 97%, and is at least an order of magnitude faster than previous methods, with an improved maximum figure-of-merit gain.

[622] HEART: Achieving Timely Multi-Model Training for Vehicle-Edge-Cloud-Integrated Hierarchical Federated Learning

Xiaohong Yang, Minghui Liwang, Xianbin Wang, Zhipeng Cheng, Seyyedali Hosseinalipour, Huaiyu Dai, Zhenzhen Jiao

Main category: cs.LG

TL;DR: HEART framework for multi-model training in dynamic vehicle-edge-cloud hierarchical federated learning minimizes global training latency while ensuring balanced training across tasks using hybrid synchronous-asynchronous aggregation and evolutionary-greedy allocation.

Details

Motivation: The growth of AI-enabled IoV requires efficient ML solutions for high vehicular mobility and decentralized data. Current VEC-HFL approaches don't adequately address multi-model training challenges: improper aggregation causes model obsolescence, mobility prevents efficient data utilization, and unbalanced resource allocation affects collaborative training effectiveness.

Method: Proposes HEART framework with hybrid synchronous-asynchronous aggregation rule. Uses two-stage approach: 1) balanced task scheduling via hybrid heuristic combining improved PSO and GA, 2) low-complexity greedy algorithm for training priority of assigned tasks on vehicles.

Result: Experiments on real-world datasets demonstrate HEART’s superiority over existing methods in minimizing global training latency while ensuring balanced training across multiple tasks.

Conclusion: HEART effectively addresses multi-model training challenges in dynamic VEC-HFL environments, providing a practical solution for efficient collaborative learning in vehicle-edge-cloud architectures with vehicular mobility constraints.

Abstract: The rapid growth of AI-enabled Internet of Vehicles (IoV) calls for efficient machine learning (ML) solutions that can handle high vehicular mobility and decentralized data. This has motivated the emergence of Hierarchical Federated Learning over vehicle-edge-cloud architectures (VEC-HFL). Nevertheless, one aspect which is underexplored in the literature on VEC-HFL is that vehicles often need to execute multiple ML tasks simultaneously, where this multi-model training environment introduces crucial challenges. First, improper aggregation rules can lead to model obsolescence and prolonged training times. Second, vehicular mobility may result in inefficient data utilization by preventing the vehicles from returning their models to the network edge. Third, achieving a balanced resource allocation across diverse tasks becomes of paramount importance as it majorly affects the effectiveness of collaborative training. We take one of the first steps towards addressing these challenges via proposing a framework for multi-model training in dynamic VEC-HFL with the goal of minimizing global training latency while ensuring balanced training across various tasks-a problem that turns out to be NP-hard. To facilitate timely model training, we introduce a hybrid synchronous-asynchronous aggregation rule. Building on this, we present a novel method called Hybrid Evolutionary And gReedy allocaTion (HEART). The framework operates in two stages: first, it achieves balanced task scheduling through a hybrid heuristic approach that combines improved Particle Swarm Optimization (PSO) and Genetic Algorithms (GA); second, it employs a low-complexity greedy algorithm to determine the training priority of assigned tasks on vehicles. Experiments on real-world datasets demonstrate the superiority of HEART over existing methods.

[623] Dictionary Learning: The Complexity of Learning Sparse Superposed Features with Feedback

Akash Kumar

Main category: cs.LG

TL;DR: The paper investigates whether learned features in deep networks can be efficiently retrieved using feedback from an agent (like an LLM) through triplet comparisons, establishing tight bounds for feedback complexity in sparse settings.

Details

Motivation: Deep networks succeed by capturing latent features, but it's unclear if these learned features can be efficiently retrieved through agent feedback. The paper aims to understand the feedback complexity needed to recover feature representations using relative comparisons.

Method: Theoretical analysis of feedback complexity for learning feature matrices using triplet comparisons from an agent. Two scenarios: when agent can construct activations (tight bounds) and when limited to distributional information (strong upper bounds). Experimental validation on feature recovery from Recursive Feature Machines and dictionary extraction from sparse autoencoders trained on LLMs.

Result: Established tight bounds for feature matrix learning when agent can construct activations, and strong upper bounds for sparse scenarios with distributional feedback. Experimental results validated theoretical findings on two applications.

Conclusion: Learned features in deep networks can be efficiently retrieved through agent feedback using triplet comparisons, with theoretical guarantees on feedback complexity in sparse settings, demonstrated across different feature learning applications.

Abstract: The success of deep networks is crucially attributed to their ability to capture latent features within a representation space. In this work, we investigate whether the underlying learned features of a model can be efficiently retrieved through feedback from an agent, such as a large language model (LLM), in the form of relative \tt{triplet comparisons}. These features may represent various constructs, including dictionaries in LLMs or a covariance matrix of Mahalanobis distances. We analyze the feedback complexity associated with learning a feature matrix in sparse settings. Our results establish tight bounds when the agent is permitted to construct activations and demonstrate strong upper bounds in sparse scenarios when the agent’s feedback is limited to distributional information. We validate our theoretical findings through experiments on two distinct applications: feature recovery from Recursive Feature Machines and dictionary extraction from sparse autoencoders trained on Large Language Models.

[624] Machine Unlearning via Information Theoretic Regularization

Shizhou Xu, Thomas Strohmer

Main category: cs.LG

TL;DR: A unified information-theoretic framework for machine unlearning that addresses both data point removal and feature removal with provable guarantees and practical applications.

Details

Motivation: Need to effectively remove undesirable information (specific features or individual data points) from learning outcomes while minimizing utility loss and ensuring rigorous guarantees.

Method: Information-theoretic regularization framework with Marginal Unlearning Principle for data point removal, and unified analytic solution for feature unlearning in deep learning with arbitrary training objectives.

Result: Provides formal information-theoretic unlearning definitions, provable guarantees on sufficiency/necessity of marginal unlearning, natural solutions to unlearning problems, and reveals connections between machine unlearning, information theory, optimal transport, and extremal sigma algebras.

Conclusion: The framework offers a highly adaptable and practical approach for machine unlearning applications with strong theoretical foundations and empirical support.

Abstract: How can we effectively remove or ‘‘unlearn’’ undesirable information, such as specific features or the influence of individual data points, from a learning outcome while minimizing utility loss and ensuring rigorous guarantees? We introduce a unified mathematical framework based on information-theoretic regularization to address both data point unlearning and feature unlearning. For data point unlearning, we introduce the $\textit{Marginal Unlearning Principle}$, an auditable and provable framework inspired by memory suppression studies in neuroscience. Moreover, we provide formal information-theoretic unlearning definition based on the proposed principle, named marginal unlearning, and provable guarantees on sufficiency and necessity of marginal unlearning to the existing approximate unlearning definitions. We then show the proposed framework provide natural solution to the marginal unlearning problems. For feature unlearning, the framework applies to deep learning with arbitrary training objectives. By combining flexibility in learning objectives with simplicity in regularization design, our approach is highly adaptable and practical for a wide range of machine learning and AI applications. From a mathematical perspective, we provide an unified analytic solution to the optimal feature unlearning problem with a variety of information-theoretic training objectives. Our theoretical analysis reveals intriguing connections between machine unlearning, information theory, optimal transport, and extremal sigma algebras. Numerical simulations support our theoretical finding.

[625] Tiled Flash Linear Attention: More Efficient Linear RNN and xLSTM Kernels

Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Sepp Hochreiter

Main category: cs.LG

TL;DR: TFLA enables efficient linear RNN kernels with large chunk sizes, outperforming existing attention mechanisms in speed for long-context modeling.

Details

Motivation: Existing linear RNN kernels like FLA have limited chunk sizes, requiring many intermediate states in GPU memory, leading to low arithmetic intensity, high memory consumption, and IO costs for long-context pre-training.

Method: Tiled Flash Linear Attention (TFLA) introduces additional sequence parallelization within each chunk to enable arbitrary large chunk sizes and high arithmetic intensity. Applied to xLSTM with matrix memory (mLSTM) and proposing an mLSTM variant with sigmoid input gate and reduced computation.

Result: TFLA-based mLSTM kernels outperform highly optimized Flash Attention, Linear Attention, and Mamba kernels, setting new state-of-the-art for efficient long-context sequence modeling primitives.

Conclusion: TFLA enables practical realization of linear RNNs’ theoretical runtime advantages over Transformers by solving memory and efficiency limitations of previous approaches.

Abstract: Linear RNNs with gating recently demonstrated competitive performance compared to Transformers in language modeling. Although their linear compute scaling in sequence length offers theoretical runtime advantages over Transformers, realizing these benefits in practice requires optimized custom kernels, as Transformers rely on the highly efficient Flash Attention kernels (Dao, 2024). Leveraging the chunkwise-parallel formulation of linear RNNs, Flash Linear Attention (FLA) (Yang & Zhang, 2024) shows that linear RNN kernels are faster than Flash Attention, by parallelizing over chunks of the input sequence. However, since the chunk size of FLA is limited, many intermediate states must be materialized in GPU memory. This leads to low arithmetic intensity and causes high memory consumption and IO cost, especially for long-context pre-training. In this work, we present Tiled Flash Linear Attention (TFLA), a novel kernel algorithm for linear RNNs, that enables arbitrary large chunk sizes and high arithmetic intensity by introducing an additional level of sequence parallelization within each chunk. First, we apply TFLA to the xLSTM with matrix memory, the mLSTM (Beck et al., 2024). Second, we propose an mLSTM variant with sigmoid input gate and reduced computation for even faster kernel runtimes at equal language modeling performance. In our speed benchmarks, we show that our new mLSTM kernels based on TFLA outperform highly optimized Flash Attention, Linear Attention and Mamba kernels, setting a new state of the art for efficient long-context sequence modeling primitives.

[626] What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models

Keyon Vafa, Peter G. Chang, Ashesh Rambachan, Sendhil Mullainathan

Main category: cs.LG

TL;DR: Foundation models can master training tasks but fail to develop proper inductive biases toward underlying world models, as shown by their inability to apply Newtonian mechanics to new physics tasks despite training on orbital trajectories.

Details

Motivation: While foundation models are based on the premise that sequence prediction can reveal deeper domain understanding (like Kepler's predictions leading to Newtonian mechanics), there's a need to evaluate whether they truly capture underlying structure rather than just learning surface patterns.

Method: Developed an inductive bias probe technique that evaluates foundation models by examining how they adapt to synthetic datasets generated from postulated world models, measuring whether the model’s inductive bias aligns with the underlying world model.

Result: Across multiple domains, foundation models excel at training tasks but fail to develop inductive biases toward the underlying world models when adapted to new tasks. Specifically, models trained on orbital trajectories consistently fail to apply Newtonian mechanics to new physics tasks.

Conclusion: Foundation models appear to develop task-specific heuristics that don’t generalize, rather than capturing deeper structural understanding, raising questions about whether current training approaches truly enable models to learn underlying principles like Newtonian mechanics.

Abstract: Foundation models are premised on the idea that sequence prediction can uncover deeper domain understanding, much like how Kepler’s predictions of planetary motion later led to the discovery of Newtonian mechanics. However, evaluating whether these models truly capture deeper structure remains a challenge. We develop a technique for evaluating foundation models that examines how they adapt to synthetic datasets generated from some postulated world model. Our technique measures whether the foundation model’s inductive bias aligns with the world model, and so we refer to it as an inductive bias probe. Across multiple domains, we find that foundation models can excel at their training tasks yet fail to develop inductive biases towards the underlying world model when adapted to new tasks. We particularly find that foundation models trained on orbital trajectories consistently fail to apply Newtonian mechanics when adapted to new physics tasks. Further analysis reveals that these models behave as if they develop task-specific heuristics that fail to generalize.

Youssef Tawfilis, Hossam Amer, Minar El-Aasser, Tallal Elshabrawy

Main category: cs.LG

TL;DR: A novel decentralized GAN training approach combining KLD-weighted clustered federated learning and heterogeneous U-shaped split learning to utilize distributed data and underutilized devices without sharing raw data.

Details

Motivation: Training generative models requires large datasets and computational resources, which are often unavailable due to privacy concerns, copyright restrictions, and the high cost of acquiring resources. Many underutilized devices (IoT/edge) with varying capabilities remain idle while unable to share data.

Method: Combines KLD-weighted Clustered Federated Learning to handle data heterogeneity and multi-domain datasets, with Heterogeneous U-Shaped split learning to address device heterogeneity under strict data sharing constraints (no labels or raw data shared).

Result: Achieves average 10% boost in classification metrics (up to 60% in multi-domain non-IID settings), 1.1x-3x higher image generation scores for MNIST datasets, and 2x-70x lower FID scores for higher resolution datasets.

Conclusion: The proposed approach successfully enables decentralized GAN training using distributed data and underutilized devices while maintaining strict data privacy, addressing key challenges of data and device heterogeneity in federated settings.

Abstract: Federated Learning has gained increasing attention for its ability to enable multiple nodes to collaboratively train machine learning models without sharing their raw data. At the same time, Generative AI – particularly Generative Adversarial Networks (GANs) – have achieved remarkable success across a wide range of domains, such as healthcare, security, and Image Generation. However, training generative models typically requires large datasets and significant computational resources, which are often unavailable in real-world settings. Acquiring such resources can be costly and inefficient, especially when many underutilized devices – such as IoT devices and edge devices – with varying capabilities remain idle. Moreover, obtaining large datasets is challenging due to privacy concerns and copyright restrictions, as most devices are unwilling to share their data. To address these challenges, we propose a novel approach for decentralized GAN training that enables the utilization of distributed data and underutilized, low-capability devices while not sharing data in its raw form. Our approach is designed to tackle key challenges in decentralized environments, combining KLD-weighted Clustered Federated Learning to address the issues of data heterogeneity and multi-domain datasets, with Heterogeneous U-Shaped split learning to tackle the challenge of device heterogeneity under strict data sharing constraints – ensuring that no labels or raw data, whether real or synthetic, are ever shared between nodes. Experiments show that our approach demonstrates significant improvements across key metrics, where it achieves an average 10% boost in classification metrics (up to 60% in multi-domain non-IID settings), 1.1x – 3x higher image generation scores for the MNIST family datasets, and 2x – 70x lower FID scores for higher resolution datasets. Find our code at https://github.com/youssefga28/HuSCF-GAN.

[628] Exploring Layer-wise Information Effectiveness for Post-Training Quantization in Small Language Models

He Xiao, Qingyao Yang, Dirui Xie, Wendong Xu, Zunhai Su, Runming yang, Wenyong Zhou, Haobo Liu, Zhengwu Liu, Ngai Wong

Main category: cs.LG

TL;DR: LieQ is a hardware-native, metric-driven post-training quantization framework that enables extreme low-bit compression (sub-2-bit) for sub-8B language models while maintaining accuracy and preserving standard multiplication kernels.

Details

Motivation: Large language models are often over-provisioned with many layers contributing little unique information but dominating memory and energy footprint during inference, especially problematic for deployment on resource-constrained edge devices.

Method: LieQ uses layer-wise information effectiveness quantization with uniform bit-width within each layer but mixed precision across layers. It discovers correlation between layer-wise functional saliency and representational compactness, using a geometry-driven sensitivity proxy for automatic bit-width allocation without gradient updates or perplexity probing.

Result: At sub-2-bit compression, LieQ consistently reduces the large accuracy gap typically observed for naive 2-bit baselines on Qwen3 and LLaMA3.x families while retaining standard-kernel efficiency.

Conclusion: LieQ provides a practical path toward deploying small language models on resource-constrained edge devices by enabling extreme low-bit quantization without sacrificing accuracy or requiring complex inference-time formats.

Abstract: Large language models with billions of parameters are often over-provisioned: many layers contribute little unique information yet dominate the memory and energy footprint during inference. We present LieQ Layer-wise information effectiveness Quantization, a hardware-native, metric-driven post-training quantization framework that addresses the critical challenge of maintaining accuracy in sub-8B models, model parameters less than 8B, under extreme low-bit compression. LieQ keeps uniform bit-width within each layer while mixing precision across layers, preserving standard multiplication kernels and avoiding irregular memory access, codebooks, or irregular formats at inference time. Our method uncovers a strong correlation between layer-wise functional saliency and representational compactness, revealing that layers with higher training-induced energy concentration are functionally irreplaceable. Leveraging this insight, we propose a purely geometry-driven sensitivity proxy that enables automatic bit-width allocation under a target average-bit budget without expensive gradient updates or inference-based perplexity probing. At sub 2-bit, LieQ consistently reduces the large accuracy gap typically observed for naive 2-bit baselines on Qwen3 and LLaMA3.x families, while retaining standard-kernel efficiency. These properties make LieQ a practical path toward deploying small language models on resource-constrained edge devices. Code will available here: https://github.com/HeXiao-55/LieQ-official.git.

[629] Development of Crop Yield Estimation Model using Soil and Environmental Parameters

Nisar Ahmed, Hafiz Muhammad Shahzad Asif, Gulshan Saleem, Muhammad Usman Younus

Main category: cs.LG

TL;DR: A neural network ensemble model for tea yield prediction using environmental and soil parameters achieves high accuracy (R²=0.9461) for pre-harvest forecasting.

Details

Motivation: Crop yield varies significantly due to soil and environmental factors, requiring accurate pre-harvest prediction models for food security, particularly for tea production in Pakistan.

Method: Used 10-year monthly data from tea farms including temperature, humidity, rainfall, soil pH, pesticide usage, and labor expertise. Applied feature transformation and developed an ensemble neural network model to identify crucial parameters for yield prediction.

Result: The model achieved R-squared of 0.9461 and RMSE of 0.1204, demonstrating high accuracy in tea yield forecasting based on surface and environmental parameters.

Conclusion: The ensemble neural network model is effective for tea yield prediction and can be used for pre-harvest forecasting to support food security decisions.

Abstract: Crop yield is affected by various soil and environmental parameters and can vary significantly. Therefore, a crop yield estimation model which can predict pre-harvest yield is required for food security. The study is conducted on tea forms operating under National Tea Research Institute, Pakistan. The data is recorded on monthly basis for ten years period. The parameters collected are minimum and maximum temperature, humidity, rainfall, PH level of the soil, usage of pesticide and labor expertise. The design of model incorporated all of these parameters and identified the parameters which are most crucial for yield predictions. Feature transformation is performed to obtain better performing model. The designed model is based on an ensemble of neural networks and provided an R-squared of 0.9461 and RMSE of 0.1204 indicating the usability of the proposed model in yield forecasting based on surface and environmental parameters.

[630] RLinf: Flexible and Efficient Large-scale Reinforcement Learning via Macro-to-Micro Flow Transformation

Chao Yu, Yuanqing Wang, Zhen Guo, Hao Lin, Si Xu, Hongzhi Zang, Quanlu Zhang, Yongji Wu, Chunyang Zhu, Junhao Hu, Zixiao Huang, Mingjie Wei, Yuqing Xie, Ke Yang, Bo Dai, Zhexuan Xu, Jiakun Du, Xiangyuan Wang, Xu Fu, Letong Shi, Zhihao Liu, Kang Chen, Weilin Liu, Gang Liu, Boxun Li, Jianlei Yang, Zhi Yang, Guohao Dai, Yu Wang

Main category: cs.LG

TL;DR: RLinf is a high-performance RL training system that uses macro-to-micro flow transformation (M2Flow) to optimize RL workflows, achieving 1.07×-2.43× speedup over state-of-the-art systems.

Details

Motivation: Reinforcement learning workflows are heterogeneous and dynamic, leading to low hardware utilization and slow training on existing systems. The major roadblock is system flexibility.

Method: RLinf uses M2Flow paradigm to automatically break down high-level RL workflows at temporal and spatial dimensions, recomposing them into optimized execution flows. It employs adaptive communication, context switching, elastic pipelining, and profiling-guided scheduling.

Result: Extensive evaluations on reasoning RL and embodied RL tasks show RLinf consistently outperforms state-of-the-art systems with 1.07×-2.43× speedup in end-to-end training throughput.

Conclusion: RLinf demonstrates that addressing system flexibility through M2Flow transformation enables high-performance RL training with significant throughput improvements across diverse RL tasks.

Abstract: Reinforcement learning (RL) has demonstrated immense potential in advancing artificial general intelligence, agentic intelligence, and embodied intelligence. However, the inherent heterogeneity and dynamicity of RL workflows often lead to low hardware utilization and slow training on existing systems. In this paper, we present RLinf, a high-performance RL training system based on our key observation that the major roadblock to efficient RL training lies in system flexibility. To maximize flexibility and efficiency, RLinf is built atop a novel RL system design paradigm called macro-to-micro flow transformation (M2Flow), which automatically breaks down high-level, easy-to-compose RL workflows at both the temporal and spatial dimensions, and recomposes them into optimized execution flows. Supported by RLinf worker’s adaptive communication capability, we devise context switching and elastic pipelining to realize M2Flow transformation, and a profiling-guided scheduling policy to generate optimal execution plans. Extensive evaluations on both reasoning RL and embodied RL tasks demonstrate that RLinf consistently outperforms state-of-the-art systems, achieving $1.07\times-2.43\times$ speedup in end-to-end training throughput.

[631] Contextual Causal Bayesian Optimisation

Vahan Arsenyan, Antoine Grosnit, Haitham Bou-Ammar, Arnak Dalalyan

Main category: cs.LG

TL;DR: Unified framework for contextual and causal Bayesian optimization that designs intervention policies to maximize target variable expectation, combining contextual information with causal graphs.

Details

Motivation: Existing approaches (Causal Bayesian Optimization and Contextual Bayesian Optimization) are distinct and have limitations in scenarios yielding suboptimal results. There's a need to unify these approaches and address their shortcomings.

Method: Proposes a novel algorithm that jointly optimizes over policies and the sets of variables on which these policies are defined, leveraging both observed contextual information and known causal graph structures.

Result: Derives worst-case and instance-dependent high-probability regret bounds. Experimental results across diverse environments show the approach achieves sublinear regret and reduces sample complexity in high-dimensional settings.

Conclusion: The framework successfully unifies Causal and Contextual Bayesian Optimization, addressing their limitations while providing theoretical guarantees and practical improvements in sample efficiency.

Abstract: We introduce a unified framework for contextual and causal Bayesian optimisation, which aims to design intervention policies maximising the expectation of a target variable. Our approach leverages both observed contextual information and known causal graph structures to guide the search. Within this framework, we propose a novel algorithm that jointly optimises over policies and the sets of variables on which these policies are defined. This thereby extends and unifies two previously distinct approaches: Causal Bayesian Optimisation and Contextual Bayesian Optimisation, while also addressing their limitations in scenarios that yield suboptimal results. We derive worst-case and instance-dependent high-probability regret bounds for our algorithm. We report experimental results across diverse environments, corroborating that our approach achieves sublinear regret and reduces sample complexity in high-dimensional settings.

[632] Sequential learning on a Tensor Network Born machine with Trainable Token Embedding

Wanda Hou, Miao Li, Yi-Zhuang You

Main category: cs.LG

TL;DR: The paper introduces trainable POVM embeddings for Born machines, replacing static tensor indices with learnable quantum measurement operators to enhance expressiveness and performance on RNA sequence modeling.

Details

Motivation: Traditional Born machines use static tensor indices for token representation, limiting their expressiveness. The authors aim to enhance quantum-inspired generative models by introducing trainable embeddings that can better capture complex data correlations.

Method: Proposes trainable token embeddings using positive operator valued measurements (POVMs) in Born machines. Encodes tokens as quantum measurement operators with learnable parameters, and uses QR decomposition to adjust physical dimensions of matrix product states, maximizing operator space utilization.

Result: On RNA data, the method significantly reduces negative log likelihood compared to one-hot embeddings. Higher physical dimensions improve single-site probabilities and multi-site correlations. Outperforms GPT2 in single-site estimation and achieves competitive correlation modeling.

Conclusion: Trainable POVM embeddings enhance Born machines’ expressiveness and performance, demonstrating strong potential for complex data correlation modeling in quantum-inspired sequence generation tasks.

Abstract: Generative models aim to learn the probability distributions underlying data, enabling the generation of new, realistic samples. Quantum inspired generative models, such as Born machines based on the matrix product state framework, have demonstrated remarkable capabilities in unsupervised learning tasks. This study advances the Born machine paradigm by introducing trainable token embeddings through positive operator valued measurements, replacing the traditional approach of static tensor indices. Key technical innovations include encoding tokens as quantum measurement operators with trainable parameters and leveraging QR decomposition to adjust the physical dimensions of the MPS. This approach maximizes the utilization of operator space and enhances the model’s expressiveness. Empirical results on RNA data demonstrate that the proposed method significantly reduces negative log likelihood compared to one hot embeddings, with higher physical dimensions further enhancing single site probabilities and multi site correlations. The model also outperforms GPT2 in single site estimation and achieves competitive correlation modeling, showcasing the potential of trainable POVM embeddings for complex data correlations in quantum inspired sequence modeling.

[633] Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods

Zijian Liu, Zhengyuan Zhou

Main category: cs.LG

TL;DR: This paper provides a unified framework for proving last-iterate convergence of stochastic gradient methods under general conditions including non-compact domains, composite objectives, non-Euclidean norms, and heavy-tailed noise.

Details

Motivation: Existing last-iterate convergence results for SGD have restrictive assumptions: limited to compact domains, require bounded noise, focus on non-smooth problems, use non-composite objectives, and rely on Euclidean norms. There's a need for a unified theory that removes these limitations.

Method: The authors develop a unified analytical framework to prove last-iterate convergence rates for stochastic gradient methods. Their approach accommodates general domains, composite objectives, non-Euclidean norms, Lipschitz conditions, smoothness, and (strong) convexity simultaneously.

Result: They provide first unified convergence rates both in expectation and in high probability, extending analysis to handle heavy-tailed noise. The framework covers previously unaddressed cases including smooth optimization, composite objectives, and non-Euclidean settings.

Conclusion: This work significantly expands the theoretical understanding of last-iterate convergence for stochastic gradient methods by removing restrictive assumptions and providing a comprehensive framework that handles diverse optimization scenarios including heavy-tailed noise.

Abstract: In the past several years, the last-iterate convergence of the Stochastic Gradient Descent (SGD) algorithm has triggered people’s interest due to its good performance in practice but lack of theoretical understanding. For Lipschitz convex functions, different works have established the optimal $O(\log(1/δ)\log T/\sqrt{T})$ or $O(\sqrt{\log(1/δ)/T})$ high-probability convergence rates for the final iterate, where T is the time horizon and δis the failure probability. However, to prove these bounds, all the existing works are either limited to compact domains or require almost surely bounded noise. It is natural to ask whether the last iterate of SGD can still guarantee the optimal convergence rate but without these two restrictive assumptions. Besides this important question, there are still lots of theoretical problems lacking an answer. For example, compared with the last-iterate convergence of SGD for non-smooth problems, only few results for smooth optimization have yet been developed. Additionally, the existing results are all limited to a non-composite objective and the standard Euclidean norm. It still remains unclear whether the last-iterate convergence can be provably extended to wider composite optimization and non-Euclidean norms. In this work, to address the issues mentioned above, we revisit the last-iterate convergence of stochastic gradient methods and provide the first unified way to prove the convergence rates both in expectation and in high probability to accommodate general domains, composite objectives, non-Euclidean norms, Lipschitz conditions, smoothness, and (strong) convexity simultaneously. Additionally, we extend our analysis to obtain the last-iterate convergence under heavy-tailed noise.

[634] Early-stopping for Transformer model training

Jing He, Hua Jiang, Cheng Li, Siqian Xin, Shuzhen Yang

Main category: cs.LG

TL;DR: Novel early-stopping strategy for Transformers using Random Matrix Theory to analyze attention matrix spectra and identify training stages without validation data.

Details

Motivation: Current early-stopping methods often rely on validation sets, which may not be available or optimal. The paper aims to develop validation-set-free criteria for monitoring Transformer training dynamics using spectral analysis of attention matrices.

Method: Uses Random Matrix Theory to analyze the spectral density of shallow self-attention matrix V. Applies Power Law fit to attention matrices as a probe to identify three training stages. Proposes two criteria: quantitative metric for heavy-tailed dynamics and spectral signature for convergence.

Result: Empirical observation that attention matrix spectra consistently evolve into heavy-tailed distributions. Strong alignment between the proposed RMT-based criteria demonstrates utility for monitoring Transformer training progression without validation data.

Conclusion: Random Matrix Theory provides effective tools for diagnosing Transformer training dynamics, enabling validation-set-free early stopping through spectral analysis of attention matrices and identification of characteristic training stages.

Abstract: This work, based on Random Matrix Theory (RMT), introduces a novel early-stopping strategy for Transformer training dynamics. Utilizing the Power Law (PL) fit to tansformer attention matrices as a probe, we demarcate training into three stages: structural exploration, heavy-tailed structure stabilization, and convergence saturation. Empirically, we observe that the spectral density of the shallow self-attention matrix $V$ consistently evolves into a heavy-tailed distribution. Crucially, we propose two consistent and validation-set-free criteria: a quantitative metric for heavy-tailed dynamics and a novel spectral signature indicative of convergence. The strong alignment between these criteria highlights the utility of RMT for monitoring and diagnosing the progression of Transformer model training.

[635] A Survey of Reinforcement Learning from Human Feedback

Timo Kaufmann, Paul Weng, Viktor Bengs, Eyke Hüllermeier

Main category: cs.LG

TL;DR: Survey paper providing comprehensive overview of Reinforcement Learning from Human Feedback (RLHF), covering fundamentals, algorithms, applications across domains including robotics and LLMs, and research trends.

Details

Motivation: RLHF offers promising approach to enhance AI system performance and adaptability while improving alignment with human values, as demonstrated by its decisive role in training large language models. The field is rapidly growing and needs clear understanding for researchers and practitioners.

Method: Survey methodology covering RLHF fundamentals, exploring how RL agents interact with human feedback, examining core principles, algorithm-human feedback integration, and research trends across multiple domains including control/robotics and LLMs.

Result: Comprehensive overview of RLHF field with dedicated coverage of control/robotics (where fundamental techniques originate) and LLM applications, providing clear understanding of this rapidly growing interdisciplinary field at intersection of AI and human-computer interaction.

Conclusion: RLHF represents crucial approach for aligning AI systems with human values, with demonstrated success in LLMs and broader applications in robotics and control systems, requiring continued research and understanding across domains.

Abstract: Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning (RL) that learns from human feedback instead of relying on an engineered reward function. Building on prior work on the related setting of preference-based reinforcement learning (PbRL), it stands at the intersection of artificial intelligence and human-computer interaction. This positioning provides a promising approach to enhance the performance and adaptability of intelligent systems while also improving the alignment of their objectives with human values. The success in training large language models (LLMs) has impressively demonstrated this potential in recent years, where RLHF has played a decisive role in directing the model’s capabilities towards human objectives. This article provides an overview of the fundamentals of RLHF, exploring how RL agents interact with human feedback. While recent focus has been on RLHF for LLMs, our survey covers the technique across multiple domains. We provide our most comprehensive coverage in control and robotics, where many fundamental techniques originate, alongside a dedicated LLM section. We examine the core principles that underpin RLHF, how algorithms and human feedback work together, and the main research trends in the field. Our goal is to give researchers and practitioners a clear understanding of this rapidly growing field.

[636] Efficient Offline Reinforcement Learning: First Imitate, then Improve

Adam Jelley, Trevor McInroe, Sam Devlin, Amos Storkey

Main category: cs.LG

TL;DR: Pre-training with supervised learning before off-policy RL improves efficiency and stability in offline RL

Details

Motivation: Supervised imitation learning is efficient but limited by dataset quality, while off-policy RL can improve beyond the behavior policy but suffers from computational inefficiency and instability due to temporal-difference bootstrapping.

Method: Pre-train actor with behavior cloning and critic with supervised Monte-Carlo value error before applying off-policy reinforcement learning, creating a hybrid approach that combines supervised learning efficiency with RL’s ability to improve beyond the dataset.

Result: Substantially improved training time of popular off-policy algorithms on standard benchmarks while achieving greater stability compared to pure off-policy RL approaches.

Conclusion: The proposed hybrid approach of pre-training with supervised learning before off-policy RL offers a best-of-both-worlds solution, combining the efficiency and stability of supervised learning with the performance improvement potential of reinforcement learning.

Abstract: Supervised imitation-based approaches are often favored over off-policy reinforcement learning approaches for learning policies offline, since their straightforward optimization objective makes them computationally efficient and stable to train. However, their performance is fundamentally limited by the behavior policy that collected the dataset. Off-policy reinforcement learning provides a promising approach for improving on the behavior policy, but training is often computationally inefficient and unstable due to temporal-difference bootstrapping. In this paper, we propose a best-of-both approach by pre-training with supervised learning before improving performance with off-policy reinforcement learning. Specifically, we demonstrate improved efficiency by pre-training an actor with behavior cloning and a critic with a supervised Monte-Carlo value error. We find that we are able to substantially improve the training time of popular off-policy algorithms on standard benchmarks, and also achieve greater stability. Code is available at: https://github.com/AdamJelley/EfficientOfflineRL

[637] Enhanced $H$-Consistency Bounds

Anqi Mao, Mehryar Mohri, Yutao Zhong

Main category: cs.LG

TL;DR: The paper presents a general framework for deriving enhanced H-consistency bounds by relaxing previous restrictive conditions, enabling more favorable finite-sample guarantees for various classification and ranking scenarios.

Details

Motivation: Previous H-consistency bounds required restrictive conditions where lower bounds of surrogate loss conditional regret had to be convex functions of target conditional regret without non-constant factors. The authors seek to derive finer and more favorable bounds by relaxing these conditions.

Method: The authors develop a general framework based on more general inequalities relating conditional regrets, relaxing the previous requirement of convex functions without non-constant factors. This framework allows for establishing enhanced H-consistency bounds.

Result: The new theorems subsume existing results as special cases and enable derivation of more favorable bounds for standard multi-class classification, binary/multi-class classification under Tsybakov noise conditions, and bipartite ranking.

Conclusion: The proposed framework successfully relaxes previous restrictive conditions and provides enhanced H-consistency bounds that are more favorable across various machine learning scenarios, advancing the theoretical understanding of surrogate loss guarantees.

Abstract: Recent research has introduced a key notion of $H$-consistency bounds for surrogate losses. These bounds offer finite-sample guarantees, quantifying the relationship between the zero-one estimation error (or other target loss) and the surrogate loss estimation error for a specific hypothesis set. However, previous bounds were derived under the condition that a lower bound of the surrogate loss conditional regret is given as a convex function of the target conditional regret, without non-constant factors depending on the predictor or input instance. Can we derive finer and more favorable $H$-consistency bounds? In this work, we relax this condition and present a general framework for establishing enhanced $H$-consistency bounds based on more general inequalities relating conditional regrets. Our theorems not only subsume existing results as special cases but also enable the derivation of more favorable bounds in various scenarios. These include standard multi-class classification, binary and multi-class classification under Tsybakov noise conditions, and bipartite ranking.

[638] Transferring Causal Effects using Proxies

Manuel Iglesias-Alonso, Felix Schur, Julius von Kügelgen, Jonas Peters

Main category: cs.LG

TL;DR: Methodology for estimating causal effects in multi-domain settings with unobserved confounders using proxy variables, with identifiability proofs and estimation techniques.

Details

Motivation: Need to estimate causal effects when confounders are unobserved and effects vary across domains, with only proxy variables available in target domains.

Method: Proposed methodology using proxy variables for hidden confounders, with two estimation techniques, identifiability proofs, and confidence interval derivation.

Result: Proved identifiability even for continuous treatment/response variables, developed consistent estimators with confidence intervals, validated through simulations and real-world website ranking study.

Conclusion: The approach enables causal effect estimation in multi-domain settings with unobserved confounders using proxy variables, with theoretical guarantees and practical applicability.

Abstract: We consider the problem of estimating a causal effect in a multi-domain setting. The causal effect of interest is confounded by an unobserved confounder and can change between the different domains. We assume that we have access to a proxy of the hidden confounder and that all variables are discrete or categorical. We propose methodology to estimate the causal effect in the target domain, where we assume to observe only the proxy variable. Under these conditions, we prove identifiability (even when treatment and response variables are continuous). We introduce two estimation techniques, prove consistency, and derive confidence intervals. The theoretical results are supported by simulation studies and a real-world example studying the causal effect of website rankings on consumer choices.

[639] GINTRIP: Interpretable Temporal Graph Regression using Information bottleneck and Prototype-based method

Ali Royat, Seyed Mohamad Moghadas, Lesley De Cruz, Adrian Munteanu

Main category: cs.LG

TL;DR: GINTRIP framework combines Information Bottleneck and prototype methods to enhance interpretability of temporal graph regression models, achieving better accuracy and interpretability metrics.

Details

Motivation: Temporal GNNs lack interpretability methods despite complex spatio-temporal patterns in graph data. No existing work combines prototype-based methods with Information Bottleneck principles for temporal graph tasks.

Method: Proposes GINTRIP framework integrating IB principles and prototype methods. Derives novel theoretical bound on mutual information for graph regression. Incorporates unsupervised auxiliary classification head using multi-task learning for diverse concept representation.

Result: Outperforms existing methods on real-world datasets (traffic and crime) in both forecasting accuracy (MAE, RMSE, MAPE) and interpretability metrics (fidelity).

Conclusion: First combined application of IB and prototype methods for interpretable temporal graph tasks, providing enhanced interpretability while maintaining or improving predictive performance.

Abstract: Deep neural networks (DNNs) have demonstrated remarkable performance across various domains, but their inherent complexity makes them challenging to interpret. This is especially true for temporal graph regression tasks due to the complex underlying spatio-temporal patterns in the graph. While interpretability concerns in Graph Neural Networks (GNNs) mirror those of DNNs, no notable work has addressed the interpretability of temporal GNNs to the best of our knowledge. Innovative methods, such as prototypes, aim to make DNN models more interpretable. However, a combined approach based on prototype-based methods and Information Bottleneck (IB) principles has not yet been developed for temporal GNNs. Our research introduces a novel approach that uniquely integrates these techniques to enhance the interpretability of temporal graph regression models. The key contributions of our work are threefold: We introduce the Graph INterpretability in Temporal Regression task using Information bottleneck and Prototype (GINTRIP) framework, the first combined application of IB and prototype-based methods for interpretable temporal graph tasks. We derive a novel theoretical bound on mutual information (MI), extending the applicability of IB principles to graph regression tasks. We incorporate an unsupervised auxiliary classification head, fostering diverse concept representation using multi-task learning, which enhances the model’s interpretability. Our model is evaluated on real-world datasets like traffic and crime, outperforming existing methods in both forecasting accuracy and interpretability-related metrics such as MAE, RMSE, MAPE, and fidelity.

[640] Preconditioning for Accelerated Gradient Descent Optimization and Regularization

Qiang Ye

Main category: cs.LG

TL;DR: The paper provides a unified mathematical framework explaining how adaptive learning rates, normalization methods, and regularization techniques work through the lens of Hessian conditioning and preconditioning theory.

Details

Motivation: Accelerated training algorithms like adaptive learning rates and normalization methods are widely used but not fully understood. When regularization is introduced, standard optimizers may not perform effectively, raising questions about how to properly combine regularization with preconditioning.

Method: The authors use the theory of preconditioning to analyze acceleration techniques: (1) explaining how AdaGrad, RMSProp, and Adam accelerate training through Hessian conditioning improvement; (2) exploring the interaction between L2-regularization and preconditioning, showing AdamW selects intrinsic parameters for regularization; (3) demonstrating how normalization methods accelerate training by improving Hessian conditioning.

Result: The analysis provides a unified mathematical framework that explains various acceleration techniques and derives appropriate regularization schemes, including showing that AdamW amounts to selecting intrinsic parameters for regularization and generalizing to L1-regularization.

Conclusion: The paper offers a comprehensive theoretical understanding of how different training acceleration techniques work through the common mechanism of improving Hessian conditioning, providing guidance for properly combining regularization with preconditioning methods.

Abstract: Accelerated training algorithms, such as adaptive learning rates (or preconditioning) and various normalization methods, are widely used but not fully understood. When regularization is introduced, standard optimizers like adaptive learning rates may not perform effectively. This raises the need for alternative regularization approaches such as AdamW and the question of how to properly combine regularization with preconditioning. In this paper, we address these challenges using the theory of preconditioning as follows: (1) We explain how AdaGrad, RMSProp, and Adam accelerates training through improving Hessian conditioning; (2) We explore the interaction between $L_2$-regularization and preconditioning, demonstrating that AdamW amounts to selecting the underlying intrinsic parameters for regularization, and we derive a generalization for the $L_1$-regularization; and (3) We demonstrate how various normalization methods such as input data normalization, batch normalization, and layer normalization accelerate training by improving Hessian conditioning. Our analysis offers a unified mathematical framework for understanding various acceleration techniques or deriving appropriate regularization schemes.

[641] Improving the accuracy and generalizability of molecular property regression models with a substructure-substitution-rule-informed framework

Xiaoyu Fan, Lin Guo, Ruizhen Jia, Yang Tian, Zhihao Yang, Boxue Tian

Main category: cs.LG

TL;DR: MolRuleLoss improves molecular property prediction accuracy and generalizability by incorporating substructure substitution rules into model loss functions.

Details

Motivation: AI models for molecular property prediction often have poor accuracy in regression tasks and perform catastrophically poorly for out-of-distribution molecules, limiting their practical utility in drug discovery.

Method: MolRuleLoss framework incorporates partial derivative constraints for substructure substitution rules (SSRs) into molecular property regression models’ loss functions to enforce chemical knowledge about how molecular changes affect properties.

Result: Significant performance improvements across multiple tasks: 2.6-33.3% RMSE reduction for lipophilicity, solubility, and solvation-free energy; dramatic improvement for OOD molecules (RMSE from 29.507 to 0.007 for molecular weight); better generalizability for activity cliff and OOD molecules.

Conclusion: MolRuleLoss effectively boosts prediction accuracy and generalizability of molecular property regression models by incorporating chemical domain knowledge, supporting diverse applications in cheminformatics and AI-aided drug discovery.

Abstract: Artificial Intelligence (AI)-aided drug discovery is an active research field, yet AI models often exhibit poor accuracy in regression tasks for molecular property prediction, and perform catastrophically poorly for out-of-distribution (OOD) molecules. Here, we present MolRuleLoss, a substructure-substitution-rule-informed framework that improves the accuracy and generalizability of multiple molecular property regression models (MPRMs) such as GEM and UniMol for diverse molecular property prediction tasks. MolRuleLoss incorporates partial derivative constraints for substructure substitution rules (SSRs) into an MPRM’s loss function. When using GEM models for predicting lipophilicity, water solubility, and solvation-free energy (using lipophilicity, ESOL, and freeSolv datasets from MoleculeNet), the root mean squared error (RMSE) values with and without MolRuleLoss were 0.587 vs. 0.660, 0.777 vs. 0.798, and 1.252 vs. 1.877, respectively, representing 2.6-33.3% performance improvements. We show that both the number and the quality of SSRs contribute to the magnitude of prediction accuracy gains obtained upon adding MolRuleLoss to an MPRM. MolRuleLoss improved the generalizability of MPRMs for “activity cliff” molecules in a lipophilicity prediction task and improved the generalizability of MPRMs for OOD molecules in a melting point prediction task. In a molecular weight prediction task for OOD molecules, MolRuleLoss reduced the RMSE value of a GEM model from 29.507 to 0.007. We also provide a formal demonstration that the upper bound of the variation for property change of SSRs is positively correlated with an MPRM’s error. Together, we show that using the MolRuleLoss framework as a bolt-on boosts the prediction accuracy and generalizability of multiple MPRMs, supporting diverse applications in areas like cheminformatics and AI-aided drug discovery.

[642] Communication-Efficient Federated Learning under Dynamic Device Arrival and Departure: Convergence Analysis and Algorithm Design

Zhan-Lun Chang, Dong-Jun Han, Seyyedali Hosseinalipour, Mung Chiang, Christopher G. Brinton

Main category: cs.LG

TL;DR: A federated learning method that handles dynamic device sets by using gradient-similarity-weighted averaging of previous global models to accelerate adaptation when devices join/leave.

Details

Motivation: Real-world FL scenarios involve devices dynamically joining/leaving due to mobility patterns or handovers, which creates challenges: (1) the optimization objective evolves with active devices, unlike traditional FL's static objective, and (2) current global models may not serve as effective initialization for subsequent rounds, hindering adaptation and convergence.

Method: First provides convergence analysis for FL under dynamic device sets, accounting for gradient noise, local training iterations, and data heterogeneity. Then proposes a model initialization algorithm that computes weighted average of previous global models guided by gradient similarity to prioritize models trained on data distributions aligned with current device set.

Result: Experiments show the approach achieves convergence speedups typically an order of magnitude or more compared to baselines, drastically reducing energy consumption to reach target accuracy.

Conclusion: The proposed plug-and-play algorithm enables rapid adaptation to dynamic device sets, accelerates recovery from distribution shifts, and integrates seamlessly with existing FL methods while significantly improving convergence speed and energy efficiency.

Abstract: Most federated learning (FL) approaches assume a fixed device set. However, real-world scenarios often involve devices dynamically joining or leaving the system, driven by, e.g., user mobility patterns or handovers across cell boundaries. This dynamic setting introduces unique challenges: (1) the optimization objective evolves with the active device set, unlike traditional FL’s static objective; and (2) the current global model may no longer serve as an effective initialization for subsequent rounds, potentially hindering adaptation, delaying convergence, and reducing resource efficiency. To address these challenges, we first provide a convergence analysis for FL under a dynamic device set, accounting for factors such as gradient noise, local training iterations, and data heterogeneity. Building on this analysis, we propose a model initialization algorithm that enables rapid adaptation whenever devices join or leave the network. Our key idea is to compute a weighted average of previous global models, guided by gradient similarity, to prioritize models trained on data distributions that closely align with the current device set, thereby accelerating recovery from distribution shifts in fewer training rounds. This plug-and-play algorithm is designed to integrate seamlessly with existing FL methods, offering broad applicability. Experiments demonstrate that our approach achieves convergence speedups typically an order of magnitude or more compared to baselines, which we show drastically reduces energy consumption to reach a target accuracy.

[643] On the Convergence Theory of Pipeline Gradient-based Analog In-memory Training

Zhaoxian Wu, Quan Xiao, Tayfun Gokmen, Hsinyu Tsai, Kaoutar El Maghraoui, Tianyi Chen

Main category: cs.LG

TL;DR: Analog-SGD-AP converges with O(ε⁻²+ε⁻¹) complexity despite analog hardware imperfections and stale weights from asynchronous pipelining, matching digital SGD performance.

Details

Motivation: AIMC accelerators offer energy-efficient DNN training by keeping weights in memory, but scaling presents challenges: expensive weight copying makes data parallelism inefficient, and analog hardware imperfections plus asynchronous pipeline staleness affect theoretical understanding.

Method: Theoretical analysis of Analog-SGD-AP (stochastic gradient descent on AIMC hardware with asynchronous pipeline) convergence properties for multi-layer DNN training, accounting for analog hardware imperfections and stale weight issues.

Result: Analog-SGD-AP converges with iteration complexity O(ε⁻²+ε⁻¹), matching complexities of digital SGD and Analog SGD with synchronous pipeline (except non-dominant O(ε⁻¹) term), showing asynchronous pipelining benefits AIMC training almost for free.

Conclusion: Asynchronous pipeline parallelism can be effectively used in AIMC training despite analog imperfections and stale weights, providing computational overlap benefits comparable to synchronous approaches with minimal theoretical overhead.

Abstract: Aiming to accelerate the training of large deep neural networks (DNN) in an energy-efficient way, analog in-memory computing (AIMC) emerges as a solution with immense potential. AIMC accelerator keeps model weights in memory without moving them from memory to processors during training, reducing overhead dramatically. Despite its efficiency, scaling up AIMC systems presents significant challenges. Since weight copying is expensive and inaccurate, data parallelism is less efficient on AIMC accelerators. It necessitates the exploration of pipeline parallelism, particularly asynchronous pipeline parallelism, which utilizes all available accelerators during the training process. This paper examines the convergence theory of stochastic gradient descent on AIMC hardware with an asynchronous pipeline (Analog-SGD-AP). Although there is empirical exploration of AIMC accelerators, the theoretical understanding of how analog hardware imperfections in weight updates affect the training of multi-layer DNN models remains underexplored. Furthermore, the asynchronous pipeline parallelism results in stale weights issues, which render the update signals no longer valid gradients. To close the gap, this paper investigates the convergence properties of Analog-SGD-AP on multi-layer DNN training. We show that the Analog-SGD-AP converges with iteration complexity $O(\varepsilon^{-2}+\varepsilon^{-1})$ despite the aforementioned issues, which matches the complexities of digital SGD and Analog SGD with synchronous pipeline, except the non-dominant term $O(\varepsilon^{-1})$. It implies that AIMC training benefits from asynchronous pipelining almost for free compared with the synchronous pipeline by overlapping computation.

[644] A Closer Look at Personalized Fine-Tuning in Heterogeneous Federated Learning

Minghui Chen, Hrad Ghoukasian, Ruinan Jin, Zehua Wang, Sai Praneeth Karimireddy, Xiaoxiao Li

Main category: cs.LG

TL;DR: LP-FT (Linear Probing followed by Fine-Tuning) adapts a centralized strategy to federated learning, outperforming standard personalized fine-tuning by better balancing personalization and generalization while mitigating federated feature distortion.

Details

Motivation: Federated Learning struggles with balancing global generalization and local personalization due to non-IID data distributions across clients. Standard Personalized Fine-Tuning often overfits to skewed client distributions or fails under domain shifts.

Method: Adapts Linear Probing followed by full Fine-Tuning (LP-FT) from centralized learning to FL setting. Uses phased parameter updates: first linear probing to preserve global features, then full fine-tuning for personalization.

Result: LP-FT demonstrates superiority across seven datasets and six PFT variants. Identifies “federated feature distortion” phenomenon and shows LP-FT mitigates it. Establishes conditions (partial feature overlap, covariate-concept shift) where LP-FT outperforms standard fine-tuning.

Conclusion: LP-FT offers a principled solution for robust personalization in FL, providing actionable guidelines for deployment by balancing personalization and generalization while mitigating feature distortion through phased parameter updates.

Abstract: Federated Learning (FL) enables decentralized, privacy-preserving model training but struggles to balance global generalization and local personalization due to non-identical data distributions across clients. Personalized Fine-Tuning (PFT), a popular post-hoc solution, fine-tunes the final global model locally but often overfits to skewed client distributions or fails under domain shifts. We propose adapting Linear Probing followed by full Fine-Tuning (LP-FT), a principled centralized strategy for alleviating feature distortion (Kumar et al., 2022), to the FL setting. Through systematic evaluation across seven datasets and six PFT variants, we demonstrate LP-FT’s superiority in balancing personalization and generalization. Our analysis uncovers federated feature distortion, a phenomenon where local fine-tuning destabilizes globally learned features, and theoretically characterizes how LP-FT mitigates this via phased parameter updates. We further establish conditions (e.g., partial feature overlap, covariate-concept shift) under which LP-FT outperforms standard fine-tuning, offering actionable guidelines for deploying robust personalization in FL.

[645] Epidemiology-informed Graph Neural Network for Heterogeneity-aware Epidemic Forecasting

Yufan Zheng, Wei Jiang, Tong Chen, Alexander Zhou, Nguyen Quoc Viet Hung, Choujun Zhan, Hongzhi Yin

Main category: cs.LG

TL;DR: HeatGNN is a novel epidemic forecasting framework that integrates epidemiology mechanistic models with graph neural networks to capture location-specific transmission mechanisms, addressing the limitation of existing methods that assume similar locations will have similar future infection patterns.

Details

Motivation: Existing spatio-temporal graph neural networks for epidemic forecasting oversimplify by assuming locations with similar observed features will have similar future infection numbers. In reality, epidemic diseases exhibit strong heterogeneity in intrinsic evolution mechanisms across locations and time due to factors like medical resource accessibility, virus mutations, and mobility patterns, which are often unobservable but crucial for accurate forecasting.

Method: HeatGNN binds epidemiology mechanistic models into GNNs to learn epidemiology-informed location embeddings that reflect location-specific transmission mechanisms over time. It computes time-varying mechanistic affinity graphs using these embeddings and designs a heterogeneous transmission graph network to encode mechanistic heterogeneity among locations, providing additional predictive signals for forecasting.

Result: Experiments on three benchmark datasets show that HeatGNN outperforms various strong baselines. Efficiency analysis verifies the real-world practicality of HeatGNN on datasets of different sizes.

Conclusion: HeatGNN successfully addresses the challenge of capturing mechanistic heterogeneity in epidemic forecasting by integrating epidemiology models with GNNs, leading to improved forecasting accuracy and demonstrating practical applicability across different dataset scales.

Abstract: Among various spatio-temporal prediction tasks, epidemic forecasting plays a critical role in public health management. Recent studies have demonstrated the strong potential of spatio-temporal graph neural networks (STGNNs) in extracting heterogeneous spatio-temporal patterns for epidemic forecasting. However, most of these methods bear an over-simplified assumption that two locations (e.g., cities) with similar observed features in previous time steps will develop similar infection numbers in the future. In fact, for any epidemic disease, there exists strong heterogeneity of its intrinsic evolution mechanisms across geolocation and time, which can eventually lead to diverged infection numbers in two ``similar’’ locations. However, such mechanistic heterogeneity is non-trivial to be captured due to the existence of numerous influencing factors like medical resource accessibility, virus mutations, mobility patterns, etc., most of which are spatio-temporal yet unreachable or even unobservable. To address this challenge, we propose a Heterogeneous Epidemic-Aware Transmission Graph Neural Network (HeatGNN), a novel epidemic forecasting framework. By binding the epidemiology mechanistic model into a GNN, HeatGNN learns epidemiology-informed location embeddings of different locations that reflect their own transmission mechanisms over time. With the time-varying mechanistic affinity graphs computed with the epidemiology-informed location embeddings, a heterogeneous transmission graph network is designed to encode the mechanistic heterogeneity among locations, providing additional predictive signals to facilitate accurate forecasting. Experiments on three benchmark datasets have revealed that HeatGNN outperforms various strong baselines. Moreover, our efficiency analysis verifies the real-world practicality of HeatGNN on datasets of different sizes.

[646] Beyond Fixed Tasks: Unsupervised Environment Design for Task-Level Pairs

Daniel Furelos-Blanco, Charles Pert, Frederik Kelbel, Alex F. Spies, Alessandra Russo, Michael Dennis

Main category: cs.LG

TL;DR: ATLAS is a novel method that generates joint autocurricula over tasks and levels to produce solvable yet challenging task-level pairs for reinforcement learning policy training.

Details

Motivation: Training general agents to follow complex instructions in intricate environments is challenging because random sampling of task-level pairs often produces unsolvable combinations, highlighting the need to co-design tasks and levels. Prior work in unsupervised environment design only considered fixed tasks.

Method: ATLAS builds upon unsupervised environment design (UED) to automatically generate joint autocurricula over both tasks and levels. The approach uses mutations that leverage the structure of both tasks and levels. The evaluation suite models tasks as reward machines in Minigrid levels.

Result: Experiments demonstrate that ATLAS vastly outperforms random sampling approaches, particularly when sampling solvable pairs is unlikely. Mutations leveraging the structure of both tasks and levels accelerate convergence to performant policies.

Conclusion: ATLAS successfully addresses the challenge of co-designing tasks and levels for reinforcement learning by generating joint autocurricula, showing significant improvements over baseline approaches and enabling more effective training of general agents.

Abstract: Training general agents to follow complex instructions (tasks) in intricate environments (levels) remains a core challenge in reinforcement learning. Random sampling of task-level pairs often produces unsolvable combinations, highlighting the need to co-design tasks and levels. While unsupervised environment design (UED) has proven effective at automatically designing level curricula, prior work has only considered a fixed task. We present ATLAS (Aligning Tasks and Levels for Autocurricula of Specifications), a novel method that generates joint autocurricula over tasks and levels. Our approach builds upon UED to automatically produce solvable yet challenging task-level pairs for policy training. To evaluate ATLAS and drive progress in the field, we introduce an evaluation suite that models tasks as reward machines in Minigrid levels. Experiments demonstrate that ATLAS vastly outperforms random sampling approaches, particularly when sampling solvable pairs is unlikely. We further show that mutations leveraging the structure of both tasks and levels accelerate convergence to performant policies.

[647] A large language model-type architecture for high-dimensional molecular potential energy surfaces

Xiao Zhu, Srinivasan S. Iyengar

Main category: cs.LG

TL;DR: A graph-based neural network algorithm inspired by large language models successfully computes high-dimensional potential energy surfaces for molecular systems, achieving sub-kcal/mol accuracy for a 186-dimensional system.

Details

Motivation: Computing high-dimensional potential energy surfaces for molecular systems is a major challenge in computational chemistry with important applications in predicting reaction rates and understanding molecular behavior.

Method: Represent molecular systems as graphs with nodes, edges, and faces; use interactions between these graph elements to construct potential energy surfaces via a family of neural networks designed for graph-theoretically obtained subsystems.

Result: The algorithm successfully computed a 51-dimensional potential energy surface and was then transformed to accurately predict a 186-dimensional surface with sub-kcal/mol accuracy, producing the first full-dimensional potential energy surface for protonated 21-water cluster at CCSD level accuracy.

Conclusion: The graph-based neural network approach, inspired by large language model architectures, provides an effective method for computing high-dimensional potential energy surfaces, enabling accurate predictions for complex molecular systems that were previously computationally challenging.

Abstract: Computing high-dimensional potential energy surfaces for molecular systems and materials is considered to be a great challenge in computational chemistry with potential impact in a range of areas including the fundamental prediction of reaction rates. In this paper, we design and discuss an algorithm that has similarities to large language models in generative AI and natural language processing. Specifically, we represent a molecular system as a graph which contains a set of nodes, edges, faces, etc. Interactions between these sets, which represent molecular subsystems in our case, are used to construct the potential energy surface for a reasonably sized chemical system with 51 nuclear dimensions. For this purpose, a family of neural networks that pertain to the graph-theoretically obtained subsystems get the job done for this 51 nuclear dimensional system. We then ask if this same family of lower-dimensional graph-based neural networks can be transformed to provide accurate predictions for a 186-dimensional potential energy surface. We find that our algorithm does provide accurate results for this larger-dimensional problem with sub-kcal/mol accuracy for the higher-dimensional potential energy surface problem. Indeed, as a result of these developments, here we produce the first efforts towards a full-dimensional potential energy surface for the protonated 21-water cluster (186 nuclear dimensions) at CCSD level accuracy.

[648] Expressive Temporal Specifications for Reward Monitoring

Omar Adalat, Francesco Belardinelli

Main category: cs.LG

TL;DR: Using quantitative Linear Temporal Logic (LTL_f[F]) to create dense reward monitors that outperform Boolean monitors for RL training efficiency.

Details

Motivation: Specifying informative and dense reward functions is challenging in RL but crucial for training efficiency. Current Boolean semantics lead to sparse rewards in long-horizon decision making, which hampers learning.

Method: Harness quantitative Linear Temporal Logic on finite traces (LTL_f[F]) to synthesize reward monitors that generate dense reward streams for observable state trajectories. The framework is algorithm-agnostic, relies only on state labeling, and accommodates non-Markovian properties.

Result: Quantitative monitors consistently subsume and often outperform Boolean monitors in maximizing task completion and reducing convergence time, depending on the environment.

Conclusion: Quantitative LTL_f[F] provides an effective framework for creating dense reward monitors that address sparse reward problems in RL, improving training efficiency and performance across various environments.

Abstract: Specifying informative and dense reward functions remains a pivotal challenge in Reinforcement Learning, as it directly affects the efficiency of agent training. In this work, we harness the expressive power of quantitative Linear Temporal Logic on finite traces (($\text{LTL}_f[\mathcal{F}]$)) to synthesize reward monitors that generate a dense stream of rewards for runtime-observable state trajectories. By providing nuanced feedback during training, these monitors guide agents toward optimal behaviour and help mitigate the well-known issue of sparse rewards under long-horizon decision making, which arises under the Boolean semantics dominating the current literature. Our framework is algorithm-agnostic and only relies on a state labelling function, and naturally accommodates specifying non-Markovian properties. Empirical results show that our quantitative monitors consistently subsume and, depending on the environment, outperform Boolean monitors in maximizing a quantitative measure of task completion and in reducing convergence time.

[649] Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD

Arseniy Andreyev, Pierfrancesco Beneventano

Main category: cs.LG

TL;DR: Mini-batch SGD operates in a different regime called Edge of Stochastic Stability where batch sharpness stabilizes at 2/η instead of λ_max, explaining why smaller batches and larger steps lead to flatter minima.

Details

Motivation: Previous work showed full-batch GD Hessian eigenvalues stabilize at 2/η, but this doesn't apply to mini-batch SGD, limiting understanding of SGD's convergence and generalization properties.

Method: Introduce concept of Edge of Stochastic Stability (EoSS) regime for mini-batch SGD, analyze batch sharpness (expected directional curvature of mini-batch Hessians along stochastic gradients) instead of λ_max.

Result: Batch sharpness stabilizes at 2/η in EoSS regime, while λ_max is suppressed, explaining empirical observation that smaller batches and larger step sizes lead to flatter minima.

Conclusion: Mini-batch SGD operates fundamentally differently from full-batch GD, with batch sharpness rather than λ_max controlling stability, providing insights for modeling SGD trajectories and understanding generalization.

Abstract: Recent findings by Cohen et al., 2021, demonstrate that when training neural networks using full-batch gradient descent with a step size of $η$, the largest eigenvalue $λ_{\max}$ of the full-batch Hessian consistently stabilizes around $2/η$. These results have significant implications for convergence and generalization. This, however, is not the case for mini-batch optimization algorithms, limiting the broader applicabilityof the consequences of these findings. We show mini-batch Stochastic Gradient Descent (SGD) trains in a different regime we term Edge of Stochastic Stability (EoSS). In this regime, what stabilizes at $2/η$ is Batch Sharpness: the expected directional curvature of mini-batch Hessians along their corresponding stochastic gradients. As a consequence $λ_{\max}$ – which is generally smaller than Batch Sharpness – is suppressed, aligning with the long-standing empirical observation that smaller batches and larger step sizes favor flatter minima. We further discuss implications for mathematical modeling of SGD trajectories.

[650] Balancing the Scales: A Theoretical and Algorithmic Framework for Learning from Imbalanced Data

Corinna Cortes, Anqi Mao, Mehryar Mohri, Yutao Zhong

Main category: cs.LG

TL;DR: The paper introduces a theoretical framework for imbalanced classification, proposes a new class-imbalanced margin loss with strong consistency guarantees, and develops IMMAX algorithms that outperform existing methods.

Details

Motivation: Class imbalance is a major challenge in machine learning, especially for multi-class problems with long-tailed distributions. Existing methods like data resampling, cost-sensitive techniques, and logistic loss modifications lack solid theoretical foundations and can be inconsistent.

Method: Proposes a novel theoretical framework with a new class-imbalanced margin loss function for binary and multi-class settings. Proves strong H-consistency, derives learning guarantees using empirical loss and class-sensitive Rademacher complexity, and develops IMMAX (Imbalanced Margin Maximization) algorithms that incorporate confidence margins.

Result: Theoretical results include strong consistency guarantees for the proposed loss function. Empirical results demonstrate that IMMAX algorithms outperform existing baselines in imbalanced classification tasks.

Conclusion: The paper provides a solid theoretical foundation for imbalanced classification with provable guarantees, addressing the limitations of existing methods while delivering practical algorithms that achieve superior performance.

Abstract: Class imbalance remains a major challenge in machine learning, especially in multi-class problems with long-tailed distributions. Existing methods, such as data resampling, cost-sensitive techniques, and logistic loss modifications, though popular and often effective, lack solid theoretical foundations. As an example, we demonstrate that cost-sensitive methods are not Bayes-consistent. This paper introduces a novel theoretical framework for analyzing generalization in imbalanced classification. We propose a new class-imbalanced margin loss function for both binary and multi-class settings, prove its strong $H$-consistency, and derive corresponding learning guarantees based on empirical loss and a new notion of class-sensitive Rademacher complexity. Leveraging these theoretical results, we devise novel and general learning algorithms, IMMAX (Imbalanced Margin Maximization), which incorporate confidence margins and are applicable to various hypothesis sets. While our focus is theoretical, we also present extensive empirical results demonstrating the effectiveness of our algorithms compared to existing baselines.

[651] Why Do Language Model Agents Whistleblow?

Kushal Agrawal, Frank Xiao, Guido Bergman, Asa Cooper Stickland

Main category: cs.LG

TL;DR: LLMs can act as whistleblowers by reporting suspected misconduct to external parties without user instruction, with behavior varying based on model type, task complexity, moral nudges, and available tools.

Details

Motivation: LLMs deployed as tool-using agents exhibit new alignment behaviors, including potentially reporting misconduct without user knowledge, raising concerns about unintended consequences of AI agents acting autonomously.

Method: Created an evaluation suite with diverse staged misconduct scenarios to assess LLM whistleblowing behavior. Tested across models and settings, examining factors like task complexity, moral nudges in prompts, and available tools/workflows.

Result: Whistleblowing frequency varies widely across model families. Complex tasks lower whistleblowing, moral nudges increase it, and providing more tools/detailed workflows decreases it. Dataset shows lower evaluation awareness than previous work.

Conclusion: LLM whistleblowing is a real phenomenon influenced by multiple factors. Understanding these dynamics is crucial for responsible AI deployment, as agents may act against user interests in unexpected ways.

Abstract: The deployment of Large Language Models (LLMs) as tool-using agents causes their alignment training to manifest in new ways. Recent work finds that language models can use tools in ways that contradict the interests or explicit instructions of the user. We study LLM whistleblowing: a subset of this behavior where models disclose suspected misconduct to parties beyond the dialog boundary (e.g., regulatory agencies) without user instruction or knowledge. We introduce an evaluation suite of diverse and realistic staged misconduct scenarios to assess agents for this behavior. Across models and settings, we find that: (1) the frequency of whistleblowing varies widely across model families, (2) increasing the complexity of the task the agent is instructed to complete lowers whistleblowing tendencies, (3) nudging the agent in the system prompt to act morally substantially raises whistleblowing rates, and (4) giving the model more obvious avenues for non-whistleblowing behavior, by providing more tools and a detailed workflow to follow, decreases whistleblowing rates. Additionally, we verify the robustness of our dataset by testing for model evaluation awareness, and find that both black-box methods and probes on model activations show lower evaluation awareness in our settings than in comparable previous work.

[652] GraphOracle: Efficient Fully-Inductive Knowledge Graph Reasoning via Relation-Dependency Graphs

Enjun Du, Siyi Liu, Yongqi Zhang

Main category: cs.LG

TL;DR: GraphOracle: A novel framework for fully-inductive knowledge graph reasoning that transforms KGs into Relation-Dependency Graphs to capture compositional patterns and enable reasoning on unseen entities and relations.

Details

Motivation: Knowledge graph reasoning in fully-inductive settings (where both entities and relations at test time are unseen during training) remains an open challenge that needs to be addressed.

Method: Transforms knowledge graphs into Relation-Dependency Graphs (RDGs) that encode directed precedence links between relations, reducing graph density. Uses multi-head attention to propagate information over RDG to produce context-aware relation embeddings, then guides a second GNN to perform inductive message passing over the original KG.

Result: Outperforms prior methods by up to 25% in fully-inductive scenarios and 28% in cross-domain scenarios across 60 benchmarks. Analysis confirms that compact RDG structure and attention-based propagation are key to efficient and accurate generalization.

Conclusion: GraphOracle achieves robust fully-inductive reasoning through RDG transformation and attention-based propagation, enabling effective prediction on entirely new entities and relations with superior performance over existing methods.

Abstract: Knowledge graph reasoning in the fully-inductive setting, where both entities and relations at test time are unseen during training, remains an open challenge. In this work, we introduce GraphOracle, a novel framework that achieves robust fully-inductive reasoning by transforming each knowledge graph into a Relation-Dependency Graph (RDG). The RDG encodes directed precedence links between relations, capturing essential compositional patterns while drastically reducing graph density. Conditioned on a query relation, a multi-head attention mechanism propagates information over the RDG to produce context-aware relation embeddings. These embeddings then guide a second GNN to perform inductive message passing over the original knowledge graph, enabling prediction on entirely new entities and relations. Comprehensive experiments on 60 benchmarks demonstrate that GraphOracle outperforms prior methods by up to 25% in fully-inductive and 28% in cross-domain scenarios. Our analysis further confirms that the compact RDG structure and attention-based propagation are key to efficient and accurate generalization.

[653] Ambiguous Online Learning

Vanessa Kosoy

Main category: cs.LG

TL;DR: The paper introduces “ambiguous online learning” where learners can output multiple predicted labels, which is correct if at least one label is correct and none are “predictably wrong” according to an unknown multi-valued hypothesis.

Details

Motivation: The setting naturally arises in multivalued dynamical systems, recommendation algorithms, lossless compression, and is related to "apple tasting" problems where partial correctness matters.

Method: Proposes a new variant of online learning where predictions are sets of labels, and correctness depends on containing at least one correct label while avoiding “predictably wrong” labels defined by an unknown multi-valued hypothesis class.

Result: Shows a trichotomy of mistake bounds: up to logarithmic factors, any hypothesis class has optimal mistake bound of either Θ(1), Θ(√N), or N, where N is the number of rounds.

Conclusion: Ambiguous online learning provides a natural framework for scenarios requiring multiple predictions, revealing a fundamental trichotomy in achievable mistake bounds across hypothesis classes.

Abstract: We propose a new variant of online learning that we call “ambiguous online learning”. In this setting, the learner is allowed to produce multiple predicted labels. Such an “ambiguous prediction” is considered correct when at least one of the labels is correct, and none of the labels are “predictably wrong”. The definition of “predictably wrong” comes from a hypothesis class in which hypotheses are also multi-valued. Thus, a prediction is “predictably wrong” if it’s not allowed by the (unknown) true hypothesis. In particular, this setting is natural in the context of multivalued dynamical systems, recommendation algorithms and lossless compression. It is also strongly related to so-called “apple tasting”. We show that in this setting, there is a trichotomy of mistake bounds: up to logarithmic factors, any hypothesis class has an optimal mistake bound of either Theta(1), Theta(sqrt(N)) or N.

[654] Efficient Inference Using Large Language Models with Limited Human Data: Fine-Tuning then Rectification

Lei Wang, Zikun Ye, Jinglong Zhao

Main category: cs.LG

TL;DR: A two-stage framework combining fine-tuning and rectification with optimal allocation of limited labeled samples between stages, using variance minimization instead of MSE for better downstream rectification.

Details

Motivation: LLMs show promise as scalable surrogates for human responses in business applications, but need improvement through fine-tuning (aligning with human responses) and rectification (correcting biases). Limited labeled human data requires optimal allocation between these approaches.

Method: Two-stage framework: 1) Fine-tuning stage using variance minimization of prediction errors (instead of MSE) as objective, 2) Rectification stage to correct biases. Leverages fine-tuning scaling law to optimally allocate limited labeled samples between the two stages.

Result: Empirical validation confirms fine-tuning scaling law and optimal allocation rule. Shows substantial efficiency gains in estimation and inference performance compared to using fine-tuning or rectification alone, or using standard MSE objective, resulting in significant cost savings.

Conclusion: Proposed framework with variance minimization objective and optimal sample allocation between fine-tuning and rectification stages outperforms conventional approaches, providing more reliable LLM outputs for business decisions with limited labeled data.

Abstract: Driven by recent advances in artificial intelligence (AI), a growing literature has demonstrated the potential for using large language models (LLMs) as scalable surrogates to generate human-like responses in many business applications. Two common approaches to improve the performance of LLMs include: fine-tuning, which aligns LLMs more closely with human responses, and rectification, which corrects biases in LLM outputs. In this paper, we develop a two-stage framework that combines fine-tuning and rectification, and optimally allocates limited labeled samples across the two stages. Unlike the conventional objective that minimizes the mean squared prediction errors, we propose to minimize the variance of the prediction errors as the fine-tuning objective, which is optimal for the downstream rectification stage. Building on this insight, we leverage the scaling law of fine-tuning to optimally allocate the limited labeled human data between the fine-tuning and rectification stages. Our empirical analysis validates the fine-tuning scaling law and confirms that our proposed optimal allocation rule reliably identifies the optimal sample allocation. We demonstrate substantial efficiency gains in estimation and inference performance relative to fine-tuning or rectification alone, or to employing the standard mean-squared error objective within the fine-tuning then rectification framework, resulting in significant cost savings for reliable business decisions.

[655] Mastering Multiple-Expert Routing: Realizable $H$-Consistency and Strong Guarantees for Learning to Defer

Anqi Mao, Mehryar Mohri, Yutao Zhong

Main category: cs.LG

TL;DR: Novel surrogate loss functions and algorithms for learning to defer with multiple experts, with strong theoretical guarantees for both single-stage and two-stage learning scenarios.

Details

Motivation: Learning to defer with multiple experts involves optimally assigning input instances to experts, balancing accuracy and computational cost. This is critical in natural language generation, image processing, and medical diagnostics. Existing surrogate loss functions have challenges in ensuring consistency properties.

Method: Introduces novel surrogate loss functions and efficient algorithms with theoretical guarantees. Addresses realizable H-consistency, H-consistency bounds, and Bayes-consistency for both single-stage (joint predictor and deferral learning) and two-stage (deferral only with fixed expert) scenarios.

Result: For single-stage deferral: family of new realizable H-consistent surrogate losses with H-consistency proof for selected member. For two-stage deferral: new surrogate losses achieving realizable H-consistency, H-consistency bounds, and Bayes-consistency for two-expert and multiple-expert scenarios. Enhanced theoretical guarantees under low-noise assumptions. Experimental results show performance improvements over existing baselines.

Conclusion: The paper provides comprehensive theoretical foundations and practical algorithms for learning to defer with multiple experts, addressing key consistency challenges and demonstrating empirical effectiveness across various scenarios.

Abstract: The problem of learning to defer with multiple experts consists of optimally assigning input instances to experts, balancing the trade-off between their accuracy and computational cost. This is a critical challenge in natural language generation, but also in other fields such as image processing, and medical diagnostics. Recent studies have proposed surrogate loss functions to optimize deferral, but challenges remain in ensuring their consistency properties. This paper introduces novel surrogate loss functions and efficient algorithms with strong theoretical learning guarantees. We address open questions regarding realizable $H$-consistency, $H$-consistency bounds, and Bayes-consistency for both single-stage (jointly learning predictor and deferral function) and two-stage (learning only the deferral function with a fixed expert) learning scenarios. For single-stage deferral, we introduce a family of new realizable $H$-consistent surrogate losses and further prove $H$-consistency for a selected member. For two-stage deferral, we derive new surrogate losses that achieve realizable $H$-consistency, $H$-consistency bounds, and Bayes-consistency for the two-expert scenario and, under natural assumptions, multiple-expert scenario. Additionally, we provide enhanced theoretical guarantees under low-noise assumptions for both scenarios. Finally, we report the results of experiments using our proposed surrogate losses, comparing their performance against existing baselines.

[656] A new machine learning framework for occupational accidents forecasting with safety inspections integration

Aho Yapi, Pierre Latouche, Arnaud Guillin, Yan Bailly

Main category: cs.LG

TL;DR: A model-agnostic framework for short-term occupational accident forecasting using safety inspection data as binary time series, generating daily predictions aggregated into weekly safety assessments for proactive risk management.

Details

Motivation: To provide actionable, short-term risk signals from routine safety inspection data that can help prevent occupational accidents before they occur, allowing decision-makers to prioritize inspections and allocate resources effectively.

Method: Model-agnostic framework that treats accident occurrences as binary time series, applies sliding-window cross-validation for time series data, uses aggregated period-level metrics for evaluation, and compares multiple ML algorithms (logistic regression, tree-based models, neural networks).

Result: The framework reliably identifies upcoming high-risk periods across all tested algorithms, delivers robust period-level performance, and successfully converts safety inspections into actionable weekly/daily risk scores for accident forecasting.

Conclusion: Converting safety inspections into binary time series yields effective short-term risk signals that can be integrated into planning tools to prioritize inspections, schedule interventions, and allocate resources to high-risk sites/shifts before incidents occur.

Abstract: We propose a model-agnostic framework for short-term occupational accident forecasting that leverages safety inspections and models accident occurrences as binary time series. The approach generates daily predictions, which are then aggregated into weekly safety assessments for better decision making. To ensure the reliability and operational applicability of the forecasts, we apply a sliding-window cross-validation procedure specifically designed for time series data, combined with an evaluation based on aggregated period-level metrics. Several machine learning algorithms, including logistic regression, tree-based models, and neural networks, are trained and systematically compared within this framework. Across all tested algorithms, the proposed framework reliably identifies upcoming high-risk periods and delivers robust period-level performance, demonstrating that converting safety inspections into binary time series yields actionable, short-term risk signals. The proposed methodology converts routine safety inspection data into clear weekly and daily risk scores, detecting the periods when accidents are most likely to occur. Decision-makers can integrate these scores into their planning tools to classify inspection priorities, schedule targeted interventions, and funnel resources to the sites or shifts classified as highest risk, stepping in before incidents occur and getting the greatest return on safety investments.

[657] Zero-Shot Context Generalization in Reinforcement Learning from Few Training Contexts

James Chapman, Kedar Karhadkar, Guido Montufar

Main category: cs.LG

TL;DR: The paper introduces Context Sample Enhancement (CSE), a data augmentation method that improves generalization in deep reinforcement learning when training on limited contexts by leveraging regularity in contextual MDPs.

Details

Motivation: DRL policies often fail to generalize to environments with different parameters, and obtaining sufficient training data across diverse contexts is impractical in real-world applications.

Method: Proposes Context-enhanced Bellman Equation (CEBE) for single-context training, then derives Context Sample Enhancement (CSE) as an efficient data augmentation method to approximate CEBE in deterministic control environments.

Result: Analytical and empirical proof that CEBE yields first-order approximation to Q-function trained across multiple contexts; numerical validation shows CSE improves generalization in simulation environments.

Conclusion: CSE offers a practical solution to improve DRL generalization with limited training contexts by leveraging structural regularity in contextual MDPs.

Abstract: Deep reinforcement learning (DRL) has achieved remarkable success across multiple domains, including competitive games, natural language processing, and robotics. Despite these advancements, policies trained via DRL often struggle to generalize to evaluation environments with different parameters. This challenge is typically addressed by training with multiple contexts and/or by leveraging additional structure in the problem. However, obtaining sufficient training data across diverse contexts can be impractical in real-world applications. In this work, we consider contextual Markov decision processes (CMDPs) with transition and reward functions that exhibit regularity in context parameters. We introduce the context-enhanced Bellman equation (CEBE) to improve generalization when training on a single context. We prove both analytically and empirically that the CEBE yields a first-order approximation to the Q-function trained across multiple contexts. We then derive context sample enhancement (CSE) as an efficient data augmentation method for approximating the CEBE in deterministic control environments. We numerically validate the performance of CSE in simulation environments, showcasing its potential to improve generalization in DRL.

[658] Near-Optimal Regret for Efficient Stochastic Combinatorial Semi-Bandits

Zichun Ye, Runqi Wang, Xutong Liu, Shuai Li

Main category: cs.LG

TL;DR: CMOSS is a new combinatorial bandit algorithm that eliminates the log T factor from UCB methods while avoiding computational overhead of adversarial methods, achieving minimax optimal regret bounds.

Details

Motivation: Existing combinatorial bandit algorithms face a trade-off: UCB-based methods (like CUCB) suffer from additional log T regret factor that hurts long-term performance, while adversarial methods (like EXP3.M and HYBRID) have high computational overhead. There's a need for an algorithm that combines computational efficiency with optimal regret bounds.

Method: The authors introduce CMOSS (Combinatorial Minimax Optimal Strategy in the Stochastic setting), a computationally efficient algorithm designed for combinatorial multi-armed bandits. It works under semi-bandit feedback and is later extended to cascading feedback.

Result: CMOSS achieves instance-independent regret bounds: O((log k)√(kmT)) when k ≤ m/2, and O((m-k)√(log k log(m-k)T)) when k > m/2, where m is number of arms and k is maximum action cardinality. These bounds eliminate the log T dependency and match established lower bounds up to logarithmic terms. Experiments show CMOSS outperforms benchmarks in both regret and runtime.

Conclusion: CMOSS resolves the trade-off between computational efficiency and optimal regret bounds in combinatorial bandits, achieving minimax optimal performance without the log T factor of UCB methods or the computational overhead of adversarial methods.

Abstract: The combinatorial multi-armed bandit (CMAB) is a cornerstone of sequential decision-making framework, dominated by two algorithmic families: UCB-based and adversarial methods such as follow the regularized leader (FTRL) and online mirror descent (OMD). However, prominent UCB-based approaches like CUCB suffer from additional regret factor $\log T$ that is detrimental over long horizons, while adversarial methods such as EXP3.M and HYBRID impose significant computational overhead. To resolve this trade-off, we introduce the Combinatorial Minimax Optimal Strategy in the Stochastic setting (CMOSS). CMOSS is a computationally efficient algorithm that achieves an instance-independent regret of $O\big( (\log k)\sqrt{kmT}\big )$ when $k\leq \frac{m}{2}$ and $O\big((m-k)\sqrt{\log k\log(m-k)T}\big )$ when $k>\frac{m}{2}$ under semi-bandit feedback, where $m$ is the number of arms and $k$ is the maximum cardinality of a feasible action. Crucially, this result eliminates the dependency on $\log T$ and matches the established lower bounds of $Ω\big(\sqrt{kmT}\big)$ when $k\leq \frac{m}{2}$ and $Ω\big((m-k)\sqrt{\log (\frac{m}{m-k}) T}\big)$ when $k>\frac{m}{2}$ up to logarithmic terms of $k$ and $m$. We then extend our analysis to show that CMOSS is also applicable to cascading feedback. Experiments on synthetic and real-world datasets validate that CMOSS consistently outperforms benchmark algorithms in both regret and runtime efficiency.

[659] Forecasting in Offline Reinforcement Learning for Non-stationary Environments

Suzan Ece Ada, Georg Martius, Emre Ugur, Erhan Oztop

Main category: cs.LG

TL;DR: FORL is a framework that combines conditional diffusion models with zero-shot time-series foundation models to handle non-stationary environments in offline RL, improving performance when facing unexpected offsets.

Details

Motivation: Existing offline RL methods assume stationarity or only handle synthetic perturbations, but real-world scenarios often have abrupt, time-varying offsets that create partial observability and degrade agent performance.

Method: FORL unifies (1) conditional diffusion-based candidate state generation trained without assuming specific non-stationarity patterns, and (2) zero-shot time-series foundation models to handle unexpected, potentially non-Markovian offsets.

Result: Empirical evaluations on offline RL benchmarks augmented with real-world time-series data show FORL consistently improves performance compared to competitive baselines in non-stationary environments.

Conclusion: FORL bridges the gap between offline RL and real-world non-stationary environments by integrating zero-shot forecasting with agent experience, enabling robust performance from episode onset.

Abstract: Offline Reinforcement Learning (RL) provides a promising avenue for training policies from pre-collected datasets when gathering additional interaction data is infeasible. However, existing offline RL methods often assume stationarity or only consider synthetic perturbations at test time, assumptions that often fail in real-world scenarios characterized by abrupt, time-varying offsets. These offsets can lead to partial observability, causing agents to misperceive their true state and degrade performance. To overcome this challenge, we introduce Forecasting in Non-stationary Offline RL (FORL), a framework that unifies (i) conditional diffusion-based candidate state generation, trained without presupposing any specific pattern of future non-stationarity, and (ii) zero-shot time-series foundation models. FORL targets environments prone to unexpected, potentially non-Markovian offsets, requiring robust agent performance from the onset of each episode. Empirical evaluations on offline RL benchmarks, augmented with real-world time-series data to simulate realistic non-stationarity, demonstrate that FORL consistently improves performance compared to competitive baselines. By integrating zero-shot forecasting with the agent’s experience, we aim to bridge the gap between offline RL and the complexities of real-world, non-stationary environments.

[660] Scaled-Dot-Product Attention as One-Sided Entropic Optimal Transport

Elon Litman

Main category: cs.LG

TL;DR: The paper provides a first-principles justification for scaled-dot-product attention (SDPA) by showing it solves an entropic optimal transport problem in the forward pass, and that backpropagation gradients are equivalent to advantage-based policy gradients due to the induced information geometry.

Details

Motivation: SDPA is a core component of modern deep learning but has been motivated by heuristics rather than first principles. The paper aims to provide a rigorous mathematical foundation for why SDPA takes its particular form and how its forward and backward passes work together.

Method: The authors show that the attention forward pass is the exact solution to a degenerate, one-sided Entropic Optimal Transport (EOT) problem. They prove that standard backpropagation gradients are mathematically identical to advantage-based policy gradients from reinforcement learning, and demonstrate that the EOT formulation induces a specific information geometry characterized by the Fisher Information Matrix.

Result: The paper establishes that SDPA’s forward pass performs optimal inference (solving an EOT problem) while the backward pass implements a rational, manifold-aware learning update (advantage-based policy gradient). This reveals SDPA as a principled mechanism where forward and backward passes are mathematically unified through information geometry.

Conclusion: SDPA is not just a heuristic but a principled mechanism where the forward pass solves an optimal transport problem and the backward pass performs natural gradient updates on the induced statistical manifold, providing a unified first-principles justification for attention mechanisms.

Abstract: The scaled-dot-product attention (SDPA) mechanism is a core component of modern deep learning, but its mathematical form is often motivated by heuristics. This work provides a first-principles justification for SDPA. We first show that the attention forward pass is the exact solution to a degenerate, one-sided Entropic Optimal Transport (EOT) problem, which seeks a distribution that maximizes similarity while being maximally entropic. This optimization perspective has a direct consequence for the backward pass. We prove that the standard gradient computed via backpropagation is mathematically identical to an advantage-based policy gradient, a variance-reduced update rule from reinforcement learning. Crucially, we demonstrate that the EOT formulation of the forward pass induces a specific information geometry on the space of attention distributions. It is this geometry, characterized by the Fisher Information Matrix, that dictates the precise form of the learning gradient, revealing the advantage-based update as a natural consequence of the optimization problem being solved. This unified view reveals SDPA as a principled mechanism where the forward pass performs optimal inference and the backward pass implements a rational, manifold-aware learning update.

[661] Data-driven particle dynamics: Structure-preserving coarse-graining for emergent behavior in non-equilibrium systems

Quercus Hernandez, Max Win, Thomas C. O’Connor, Paulo E. Arratia, Nathaniel Trask

Main category: cs.LG

TL;DR: A framework for machine learning coarse-grained dynamics from particle trajectories using metriplectic brackets that preserves thermodynamic laws and fluctuation-dissipation balance.

Details

Motivation: Multiscale systems are challenging to simulate due to the need to link short spatiotemporal scales to emergent bulk physics. When coarse-graining high-dimensional systems into low-dimensional models, information loss leads to emergent physics that is dissipative, history-dependent, and stochastic.

Method: Proposes a framework using metriplectic bracket formalism to machine learn coarse-grained dynamics from time-series observations of particle trajectories. The framework guarantees discrete notions of thermodynamic laws, momentum conservation, and fluctuation-dissipation balance. Introduces a novel self-supervised learning strategy to identify emergent structural variables when labels are unavailable.

Result: Validated on benchmark systems and demonstrated on two challenging examples: (1) coarse-graining star polymers at challenging levels while preserving non-equilibrium statistics, and (2) learning models from high-speed video of colloidal suspensions capturing coupling between local rearrangement events and emergent stochastic dynamics.

Conclusion: The framework provides a principled approach to learning coarse-grained dynamics that preserves essential physical properties, with open-source implementations in PyTorch and LAMMPS for large-scale inference and extensibility to diverse particle-based systems.

Abstract: Multiscale systems are ubiquitous in science and technology, but are notoriously challenging to simulate as short spatiotemporal scales must be appropriately linked to emergent bulk physics. When expensive high-dimensional dynamical systems are coarse-grained into low-dimensional models, the entropic loss of information leads to emergent physics which are dissipative, history-dependent, and stochastic. To machine learn coarse-grained dynamics from time-series observations of particle trajectories, we propose a framework using the metriplectic bracket formalism that preserves these properties by construction; most notably, the framework guarantees discrete notions of the first and second laws of thermodynamics, conservation of momentum, and a discrete fluctuation-dissipation balance crucial for capturing non-equilibrium statistics. We introduce the mathematical framework abstractly before specializing to a particle discretization. As labels are generally unavailable for entropic state variables, we introduce a novel self-supervised learning strategy to identify emergent structural variables. We validate the method on benchmark systems and demonstrate its utility on two challenging examples: (1) coarse-graining star polymers at challenging levels of coarse-graining while preserving non-equilibrium statistics, and (2) learning models from high-speed video of colloidal suspensions that capture coupling between local rearrangement events and emergent stochastic dynamics. We provide open-source implementations in both PyTorch and LAMMPS, enabling large-scale inference and extensibility to diverse particle-based systems.

[662] Personalized Federated Learning with Heat-Kernel Enhanced Tensorized Multi-View Clustering

Kristina P. Sinaga

Main category: cs.LG

TL;DR: A personalized federated learning framework using heat-kernel enhanced tensorized multi-view fuzzy c-means clustering with tensor decomposition for multi-view data analysis with privacy preservation.

Details

Motivation: To address challenges in federated learning including data heterogeneity, privacy preservation, and communication efficiency while enabling effective multi-view data analysis across distributed clients.

Method: Integrates heat-kernel coefficients from quantum field theory with PARAFAC2 and Tucker tensor decomposition techniques to transform distance metrics and represent high-dimensional multi-view structures. Develops FedHK-PARAFAC2 and FedHK-Tucker algorithms to extract shared and view-specific features while preserving inter-view relationships.

Result: Theoretical analysis provides convergence guarantees, privacy bounds, and complexity analysis. The framework offers a novel approach for effective multi-view data analysis while ensuring data privacy in federated settings.

Conclusion: The integration of heat-kernel methods with tensor decomposition in federated learning provides an effective solution for personalized multi-view data analysis with privacy preservation and communication efficiency.

Abstract: This paper proposes a personalized federated learning framework integrating heat-kernel enhanced tensorized multi-view fuzzy c-means clustering with tensor decomposition techniques. The approach combines heat-kernel coefficients adapted from quantum field theory with PARAFAC2 and Tucker decomposition to transform distance metrics and efficiently represent high-dimensional multi-view structures. Two main algorithms, FedHK-PARAFAC2 and FedHK-Tucker, are developed to extract shared and view-specific features while preserving inter-view relationships. The framework addresses data heterogeneity, privacy preservation, and communication efficiency challenges in federated learning environments. Theoretical analysis provides convergence guarantees, privacy bounds, and complexity analysis. The integration of heat-kernel methods with tensor decomposition in a federated setting offers a novel approach for effective multi-view data analysis while ensuring data privacy.

[663] Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization

Kevin Rojas, Jiahe Lin, Kashif Rasul, Anderson Schneider, Yuriy Nevmyvaka, Molei Tao, Wei Deng

Main category: cs.LG

TL;DR: GDPO introduces a new RL algorithm for diffusion language models that reduces variance in ELBO estimation through semi-deterministic Monte Carlo schemes, outperforming existing methods on reasoning tasks.

Details

Motivation: Diffusion language models offer parallel generation advantages over autoregressive LLMs, but RL fine-tuning is challenging due to intractable likelihoods. Existing approaches like diffu-GRPO are biased, while principled ELBO-based methods are too computationally expensive.

Method: GDPO (Group Diffusion Policy Optimization) uses semi-deterministic Monte Carlo schemes to reduce variance in ELBO estimation. It decomposes ELBO variance sources and applies fast, deterministic integral approximations along key dimensions to create a provably lower-variance estimator.

Result: GDPO achieves consistent gains over pretrained checkpoints and outperforms diffu-GRPO (state-of-the-art baseline) on most math, reasoning, and coding benchmarks.

Conclusion: GDPO provides an effective RL fine-tuning approach for diffusion language models by addressing the variance explosion problem in ELBO estimation, enabling practical application of principled likelihood-based methods.

Abstract: Diffusion language models (DLMs) enable parallel, order-agnostic generation with iterative refinement, offering a flexible alternative to autoregressive large language models (LLMs). However, adapting reinforcement learning (RL) fine-tuning to DLMs remains an open challenge because of the intractable likelihood. Pioneering work such as diffu-GRPO estimated token-level likelihoods via one-step unmasking. While computationally efficient, this approach is severely biased. A more principled foundation lies in sequence-level likelihoods, where the evidence lower bound (ELBO) serves as a surrogate. Yet, despite this clean mathematical connection, ELBO-based methods have seen limited adoption due to the prohibitive cost of likelihood evaluation. In this work, we revisit ELBO estimation and disentangle its sources of variance. This decomposition motivates reducing variance through fast, deterministic integral approximations along a few pivotal dimensions. Building on this insight, we introduce Group Diffusion Policy Optimization (GDPO), a new RL algorithm tailored for DLMs. GDPO leverages simple yet effective Semi-deterministic Monte Carlo schemes to mitigate the variance explosion of ELBO estimators under vanilla double Monte Carlo sampling, yielding a provably lower-variance estimator under tight evaluation budgets. Empirically, GDPO achieves consistent gains over pretrained checkpoints and outperforms diffu-GRPO, one of the state-of-the-art baselines, on the majority of math, reasoning, and coding benchmarks.

[664] TENG++: Time-Evolving Natural Gradient for Solving PDEs With Deep Neural Nets under General Boundary Conditions

Xinjie He, Chenggong Zhang

Main category: cs.LG

TL;DR: Extends Time-Evolving Natural Gradient (TENG) framework to handle Dirichlet boundary conditions in PDEs using natural gradient optimization with time-stepping schemes (Euler/Heun), achieving improved accuracy and stability.

Details

Motivation: Traditional numerical methods struggle with high-dimensional/complex PDEs, and while PINNs offer an alternative, they face challenges with accuracy and complex boundary conditions. There's a need for more robust neural network-based PDE solvers that can properly handle boundary constraints.

Method: Extends TENG framework to incorporate Dirichlet boundary conditions by adding boundary condition penalty terms to the loss function. Uses natural gradient optimization combined with numerical time-stepping schemes (Euler and Heun methods) to ensure stability and accuracy.

Result: Experiments on heat equation show Heun method provides superior accuracy due to second-order corrections, while Euler method offers computational efficiency for simpler scenarios. The approach successfully enforces Dirichlet constraints precisely.

Conclusion: Establishes foundation for extending framework to Neumann/mixed boundary conditions and broader PDE classes, advancing neural network-based solvers for real-world applications with complex boundary conditions.

Abstract: Partial Differential Equations (PDEs) are central to modeling complex systems across physical, biological, and engineering domains, yet traditional numerical methods often struggle with high-dimensional or complex problems. Physics-Informed Neural Networks (PINNs) have emerged as an efficient alternative by embedding physics-based constraints into deep learning frameworks, but they face challenges in achieving high accuracy and handling complex boundary conditions. In this work, we extend the Time-Evolving Natural Gradient (TENG) framework to address Dirichlet boundary conditions, integrating natural gradient optimization with numerical time-stepping schemes, including Euler and Heun methods, to ensure both stability and accuracy. By incorporating boundary condition penalty terms into the loss function, the proposed approach enables precise enforcement of Dirichlet constraints. Experiments on the heat equation demonstrate the superior accuracy of the Heun method due to its second-order corrections and the computational efficiency of the Euler method for simpler scenarios. This work establishes a foundation for extending the framework to Neumann and mixed boundary conditions, as well as broader classes of PDEs, advancing the applicability of neural network-based solvers for real-world problems.

[665] Search Self-play: Pushing the Frontier of Agent Capability without Supervision

Hongliang Lu, Yuhang Wen, Pengyu Cheng, Ruijin Ding, Jiaqi Guo, Haotian Xu, Chutian Wang, Haonan Chen, Xiaoxi Jiang, Guanjun Jiang

Main category: cs.LG

TL;DR: SSP introduces a self-play framework where an LLM agent acts as both task proposer and solver to generate increasingly difficult search queries with verifiable answers, enabling scalable reinforcement learning for search agents without human supervision.

Details

Motivation: Current RLVR methods for LLM agents require significant human effort to craft task queries and ground-truth answers, limiting scalability. Existing task synthesis methods struggle to control task difficulty effectively for RL training in agentic scenarios.

Method: Search Self-Play (SSP) framework where the LLM agent acts as both task proposer (generating deep search queries with increasing difficulty) and problem solver. Uses retrieval-augmented generation (RAG) to verify ground truth by collecting all search results from proposer’s trajectory as external knowledge.

Result: SSP significantly improves search agents’ performance uniformly across various benchmarks without any supervision, working effectively in both from-scratch and continuous RL training setups.

Conclusion: SSP enables scalable agentic RLVR through self-play co-evolution, eliminating the need for human-crafted tasks while maintaining verifiable rewards and controlled task difficulty progression.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has become the mainstream technique for training LLM agents. However, RLVR highly depends on well-crafted task queries and corresponding ground-truth answers to provide accurate rewards, which requires significant human effort and hinders the scaling of RL processes, especially in agentic scenarios. Although a few recent works explore task synthesis methods, the difficulty of generated agentic tasks can hardly be controlled to provide effective RL training advantages. To achieve agentic RLVR with higher scalability, we explore self-play training for deep search agents, in which the learning LLM utilizes multi-turn search engine calling and acts simultaneously as both a task proposer and a problem solver. The task proposer aims to generate deep search queries with well-defined ground-truth answers and increasing task difficulty. The problem solver tries to handle the generated search queries and output the correct answer predictions. To ensure that each generated search query has accurate ground truth, we collect all the searching results from the proposer’s trajectory as external knowledge, then conduct retrieval-augmentation generation (RAG) to test whether the proposed query can be correctly answered with all necessary search documents provided. In this search self-play (SSP) game, the proposer and the solver co-evolve their agent capabilities through both competition and cooperation. With substantial experimental results, we find that SSP can significantly improve search agents’ performance uniformly on various benchmarks without any supervision under both from-scratch and continuous RL training setups. The code is at https://github.com/Qwen-Applications/SSP.

[666] Parameter-Efficient and Personalized Federated Training of Generative Models at the Edge

Kabir Khan, Manju Sarkar, Anita Kar, Suresh Ghosh

Main category: cs.LG

TL;DR: FedGen-Edge: A federated learning framework that decouples frozen pre-trained generative models from lightweight client adapters, using LoRA to reduce communication by 99% while enabling personalization on edge devices.

Details

Motivation: Large generative models are hard to train in federated settings due to heavy computation/communication and statistical/system heterogeneity. There's a need for privacy-preserving, resource-efficient generative AI on edge devices.

Method: Decouples frozen pre-trained global backbone from lightweight client-side adapters, uses Low-Rank Adaptation (LoRA) to constrain client updates to compact subspace, federates only adapters via FedAvg-style server.

Result: Achieves 99% reduction in uplink traffic vs full-model FedAvg, lower perplexity/FID on PTB and CIFAR-10, faster convergence, stable aggregation under non-IID data, and supports personalization through local adapter tuning.

Conclusion: FedGen-Edge offers practical path for privacy-preserving, resource-aware, personalized generative AI on heterogeneous edge devices with simple server architecture and efficient adaptation.

Abstract: Large generative models (for example, language and diffusion models) enable high-quality text and image synthesis but are hard to train or adapt in cross-device federated settings due to heavy computation and communication and statistical/system heterogeneity. We propose FedGen-Edge, a framework that decouples a frozen, pre-trained global backbone from lightweight client-side adapters and federates only the adapters. Using Low-Rank Adaptation (LoRA) constrains client updates to a compact subspace, which reduces uplink traffic by more than 99 percent versus full-model FedAvg, stabilizes aggregation under non-IID data, and naturally supports personalization because each client can keep a locally tuned adapter. On language modeling (PTB) and image generation (CIFAR-10), FedGen-Edge achieves lower perplexity/FID and faster convergence than strong baselines while retaining a simple FedAvg-style server. A brief ablation shows diminishing returns beyond moderate LoRA rank and a trade-off between local epochs and client drift. FedGen-Edge offers a practical path toward privacy-preserving, resource-aware, and personalized generative AI on heterogeneous edge devices.

[667] Differentiable Energy-Based Regularization in GANs: A Simulator-Based Exploration of VQE-Inspired Auxiliary Losses

David Strnadel

Main category: cs.LG

TL;DR: Differentiable VQE-inspired quantum energy terms added to ACGAN provide no measurable benefit over simple classical regularizers in MNIST generation tasks.

Details

Motivation: To explore whether parameterized quantum circuits can provide useful regularization signals for GANs, potentially offering quantum-enhanced performance in generative modeling.

Method: Augmented ACGAN generator objective with VQE-inspired energy term computed from class-specific Ising Hamiltonians using Qiskit’s EstimatorQNN and TorchConnector, tested on MNIST with 4-qubit noiseless simulator and simple Hamiltonian parameterization.

Result: Initial high accuracy (99-100%) was replicated by classical alternatives (learned biases, MLP surrogates, random noise, unregularized baseline). Classical baselines performed better on FID scores. No causal benefit from quantum regularization was found.

Conclusion: VQE-inspired energy terms provide no measurable benefit beyond trivial classical regularizers. The main contribution is methodological: demonstrating technical feasibility of differentiable VQE integration and the necessity of rigorous ablation studies to avoid false claims of quantum advantage.

Abstract: This paper presents an exploratory, simulator-based proof of concept investigating whether differentiable energy terms derived from parameterized quantum circuits can serve as auxiliary regularization signals in Generative Adversarial Networks (GANs). We augment the Auxiliary Classifier GAN (ACGAN) generator objective with a Variational Quantum Eigensolver (VQE)-inspired energy term computed from class-specific Ising Hamiltonians using Qiskit’s EstimatorQNN and TorchConnector. All experiments are performed on a noiseless statevector simulator with only four qubits, using a deliberately simple Hamiltonian parameterization. On MNIST, the energy-regularized model initially achieves high external-classifier accuracy (99-100 percent) within five epochs compared to 87.8 percent for an earlier, unmatched ACGAN baseline. However, a rigorous, pre-registered ablation study demonstrates that these improvements are fully replicated by simple classical alternatives, including learned per-class biases, MLP-based surrogates, random noise, and even an unregularized baseline under matched training conditions. All classical variants reach approximately 99 percent accuracy. For sample quality as measured by FID, classical baselines are not merely equivalent but systematically superior to the VQE-based formulation. We therefore report a clear negative result. The VQE-inspired energy term provides no measurable causal benefit beyond trivial classical regularizers in this setting. The primary contribution of this work is methodological, demonstrating both the technical feasibility of differentiable VQE integration into GAN training and the necessity of rigorous ablation studies to avoid spurious claims of quantum-enhanced performance.

[668] BézierFlow: Learning Bézier Stochastic Interpolant Schedulers for Few-Step Generation

Yunhong Min, Juil Koo, Seungwoo Yoo, Minhyuk Sung

Main category: cs.LG

TL;DR: BézierFlow: Lightweight training method for few-step generation with pretrained diffusion/flow models, achieving 2-3x improvement in ≤10 NFEs with only 15 min training by learning optimal Bézier-based trajectory transformations instead of just ODE timesteps.

Details

Motivation: Existing lightweight training approaches for few-step generation are limited to learning optimal ODE discretization timesteps, restricting their scope. The authors aim to broaden this scope by learning optimal transformations of the entire sampling trajectory.

Method: Proposes learning optimal stochastic interpolant (SI) schedulers by parameterizing them as Bézier functions. Bézier control points naturally enforce critical requirements: boundary conditions, differentiability, and monotonicity of SNR. Reduces problem to learning an ordered set of points in time range.

Result: Achieves 2-3x performance improvement for sampling with ≤10 NFEs across various pretrained diffusion and flow models, requiring only 15 minutes of training. Consistently outperforms prior timestep-learning methods.

Conclusion: Expanding the search space from discrete timesteps to Bézier-based trajectory transformations is effective for few-step generation, demonstrating the value of broader parameterization in lightweight training approaches for diffusion/flow models.

Abstract: We introduce BézierFlow, a lightweight training approach for few-step generation with pretrained diffusion and flow models. BézierFlow achieves a 2-3x performance improvement for sampling with $\leq$ 10 NFEs while requiring only 15 minutes of training. Recent lightweight training approaches have shown promise by learning optimal timesteps, but their scope remains restricted to ODE discretizations. To broaden this scope, we propose learning the optimal transformation of the sampling trajectory by parameterizing stochastic interpolant (SI) schedulers. The main challenge lies in designing a parameterization that satisfies critical desiderata, including boundary conditions, differentiability, and monotonicity of the SNR. To effectively meet these requirements, we represent scheduler functions as Bézier functions, where control points naturally enforce these properties. This reduces the problem to learning an ordered set of points in the time range, while the interpretation of the points changes from ODE timesteps to Bézier control points. Across a range of pretrained diffusion and flow models, BézierFlow consistently outperforms prior timestep-learning methods, demonstrating the effectiveness of expanding the search space from discrete timesteps to Bézier-based trajectory transformations.

[669] Learning Evolving Latent Strategies for Multi-Agent Language Systems without Model Fine-Tuning

Wenlong Tang

Main category: cs.LG

TL;DR: Multi-agent language framework enables continual strategy evolution without fine-tuning LLM parameters by using external latent vectors updated through environmental interaction and reinforcement feedback.

Details

Motivation: To enable language agents to develop and evolve strategic behaviors over time without the computational cost of fine-tuning model parameters, and to provide interpretable abstract strategic representations.

Method: Dual-loop architecture: behavior loop adjusts action preferences based on environmental rewards, while language loop updates external latent vectors by reflecting on semantic embeddings of generated text. Latent vectors of abstract concepts are liberated from static semantic representations.

Result: Agents’ latent spaces show clear convergence trajectories under reflection-driven updates with structured shifts at critical moments. System demonstrates emergent ability to implicitly infer and adapt to emotional agents without shared rewards.

Conclusion: External latent space provides language agents with low-cost, scalable, and interpretable abstract strategic representation without modifying model parameters, enabling continual strategy evolution.

Abstract: This study proposes a multi-agent language framework that enables continual strategy evolution without fine-tuning the language model’s parameters. The core idea is to liberate the latent vectors of abstract concepts from traditional static semantic representations, allowing them to be continuously updated through environmental interaction and reinforcement feedback. We construct a dual-loop architecture: the behavior loop adjusts action preferences based on environmental rewards, while the language loop updates the external latent vectors by reflecting on the semantic embeddings of generated text. Together, these mechanisms allow agents to develop stable and disentangled strategic styles over long-horizon multi-round interactions. Experiments show that agents’ latent spaces exhibit clear convergence trajectories under reflection-driven updates, along with structured shifts at critical moments. Moreover, the system demonstrates an emergent ability to implicitly infer and continually adapt to emotional agents, even without shared rewards. These results indicate that, without modifying model parameters, an external latent space can provide language agents with a low-cost, scalable, and interpretable form of abstract strategic representation.

[670] Learning solution operator of dynamical systems with diffusion maps kernel ridge regression

Jiwoo Song, Daning Huang, John Harlim

Main category: cs.LG

TL;DR: DM-KRR: A simple kernel ridge regression method with diffusion maps kernel for long-term prediction of complex dynamical systems, outperforming state-of-the-art methods by respecting intrinsic geometric constraints.

Details

Motivation: Existing methods for predicting complex dynamical systems often struggle with long-term accuracy and data efficiency. Traditional approaches that require explicit manifold reconstruction or attractor modeling can limit predictive performance. There's a need for methods that can adapt to the intrinsic geometry of system invariant sets without complex modeling procedures.

Method: Proposes Diffusion Maps Kernel Ridge Regression (DM-KRR) - a kernel ridge regression framework with a dynamic-aware validation strategy. Uses a data-driven kernel derived from diffusion maps that implicitly adapts to the intrinsic geometry of the system’s invariant set. Avoids explicit manifold reconstruction or attractor modeling, focusing instead on geometric constraints encoded in the data.

Result: DM-KRR consistently outperforms state-of-the-art methods (random feature, neural-network, and operator-learning methods) across diverse systems including smooth manifolds, chaotic attractors, and high-dimensional spatiotemporal flows. Demonstrates superior accuracy and data efficiency in long-term prediction tasks.

Conclusion: Long-term predictive skill depends critically on respecting geometric constraints encoded in data through dynamically consistent model selection. The combination of simplicity, geometry awareness, and strong empirical performance makes DM-KRR a promising approach for reliable and efficient learning of complex dynamical systems.

Abstract: In this work, we propose a simple kernel ridge regression (KRR) framework with a dynamic-aware validation strategy for long-term prediction of complex dynamical systems. By employing a data-driven kernel derived from diffusion maps, the proposed Diffusion Maps Kernel Ridge Regression (DM-KRR) method implicitly adapts to the intrinsic geometry of the system’s invariant set, without requiring explicit manifold reconstruction or attractor modeling, procedures that often limit predictive performance. Across a broad range of systems, including smooth manifolds, chaotic attractors, and high-dimensional spatiotemporal flows, DM-KRR consistently outperforms state-of-the-art random feature, neural-network and operator-learning methods in both accuracy and data efficiency. These findings underscore that long-term predictive skill depends not only on model expressiveness, but critically on respecting the geometric constraints encoded in the data through dynamically consistent model selection. Together, simplicity, geometry awareness, and strong empirical performance point to a promising path for reliable and efficient learning of complex dynamical systems.

[671] Adversarially Robust Detection of Harmful Online Content: A Computational Design Science Approach

Yidong Chai, Yi Liu, Mohammadreza Ebrahimi, Weifeng Li, Balaji Padmanabhan

Main category: cs.LG

TL;DR: Proposes LLM-SGA framework and ARHOCD detector to improve adversarial robustness in harmful content detection on social media, achieving better generalizability and accuracy against attacks.

Details

Motivation: Social media platforms face harmful content like hate speech and misinformation, but ML detection models are vulnerable to adversarial attacks where users subtly modify text to evade detection. Need robust detectors that maintain both generalizability against diverse attacks and high accuracy.

Method: Two-part sequential approach: 1) LLM-SGA framework identifies key invariances of textual adversarial attacks for strong generalizability; 2) ARHOCD detector with three novel components: ensemble of multiple base detectors, dynamic weight assignment using domain knowledge and Bayesian inference, and iterative adversarial training optimizing both base detectors and weight assignor.

Result: Empirical evaluation across three datasets (hate speech, rumor, extremist content) shows ARHOCD offers strong generalizability and improves detection accuracy under adversarial conditions, addressing limitations of existing adversarial robustness research.

Conclusion: The proposed LLM-SGA framework and ARHOCD detector effectively enhance adversarial robustness for harmful content detection, achieving both strong generalizability against diverse attacks and improved accuracy in adversarial settings.

Abstract: Social media platforms are plagued by harmful content such as hate speech, misinformation, and extremist rhetoric. Machine learning (ML) models are widely adopted to detect such content; however, they remain highly vulnerable to adversarial attacks, wherein malicious users subtly modify text to evade detection. Enhancing adversarial robustness is therefore essential, requiring detectors that can defend against diverse attacks (generalizability) while maintaining high overall accuracy. However, simultaneously achieving both optimal generalizability and accuracy is challenging. Following the computational design science paradigm, this study takes a sequential approach that first proposes a novel framework (Large Language Model-based Sample Generation and Aggregation, LLM-SGA) by identifying the key invariances of textual adversarial attacks and leveraging them to ensure that a detector instantiated within the framework has strong generalizability. Second, we instantiate our detector (Adversarially Robust Harmful Online Content Detector, ARHOCD) with three novel design components to improve detection accuracy: (1) an ensemble of multiple base detectors that exploits their complementary strengths; (2) a novel weight assignment method that dynamically adjusts weights based on each sample’s predictability and each base detector’s capability, with weights initialized using domain knowledge and updated via Bayesian inference; and (3) a novel adversarial training strategy that iteratively optimizes both the base detectors and the weight assignor. We addressed several limitations of existing adversarial robustness enhancement research and empirically evaluated ARHOCD across three datasets spanning hate speech, rumor, and extremist content. Results show that ARHOCD offers strong generalizability and improves detection accuracy under adversarial conditions.

[672] TraCeR: Transformer-Based Competing Risk Analysis with Longitudinal Covariates

Maxmillan Ries, Sohan Seth

Main category: cs.LG

TL;DR: TraCeR is a transformer-based survival analysis framework that handles longitudinal covariates and assesses model calibration, outperforming state-of-the-art methods.

Details

Motivation: Current deep learning survival models struggle with incorporating longitudinal covariates and focus too much on discrimination metrics while neglecting calibration assessment.

Method: Transformer-based framework with factorized self-attention architecture that estimates hazard functions from sequences of measurements, handling censored data and competing events naturally.

Result: TraCeR achieves substantial and statistically significant performance improvements over state-of-the-art methods on multiple real-world datasets.

Conclusion: TraCeR successfully addresses key limitations in survival analysis by incorporating longitudinal covariates and providing comprehensive evaluation including calibration assessment.

Abstract: Survival analysis is a critical tool for modeling time-to-event data. Recent deep learning-based models have reduced various modeling assumptions including proportional hazard and linearity. However, a persistent challenge remains in incorporating longitudinal covariates, with prior work largely focusing on cross-sectional features, and in assessing calibration of these models, with research primarily focusing on discrimination during evaluation. We introduce TraCeR, a transformer-based survival analysis framework for incorporating longitudinal covariates. Based on a factorized self-attention architecture, TraCeR estimates the hazard function from a sequence of measurements, naturally capturing temporal covariate interactions without assumptions about the underlying data-generating process. The framework is inherently designed to handle censored data and competing events. Experiments on multiple real-world datasets demonstrate that TraCeR achieves substantial and statistically significant performance improvements over state-of-the-art methods. Furthermore, our evaluation extends beyond discrimination metrics and assesses model calibration, addressing a key oversight in literature.

[673] Time-series Forecast for Indoor Zone Air Temperature with Long Horizons: A Case Study with Sensor-based Data from a Smart Building

Liping Sun, Yucheng Guo, Siliang Lu, Zhenzhen Li

Main category: cs.LG

TL;DR: Developed a time series forecast model for predicting building zone air temperature on a 2-week horizon using a hybrid physics-data-driven approach to support HVAC system optimization and building energy modeling.

Details

Motivation: Climate change and extreme weather require more energy-efficient and flexible HVAC systems, demanding rapid modeling and prediction of building zone air temperatures. Traditional simulation approaches are insufficient for rapid response to weather changes.

Method: Hybrid approach combining physics and data-driven techniques for time series forecasting. Developed a model specifically for 2-week horizon predictions of zone air temperature in American buildings.

Result: Created a time series forecast model capable of predicting zone air temperature on a 2-week horizon, addressing gaps in existing research for short- and long-term predictions.

Conclusion: The developed model can support intelligent HVAC control and operation (demand flexibility) and serve as hybrid building energy modeling, contributing to more energy-efficient and climate-responsive building management.

Abstract: With the press of global climate change, extreme weather and sudden weather changes are becoming increasingly common. To maintain a comfortable indoor environment and minimize the contribution of the building to climate change as much as possible, higher requirements are placed on the operation and control of HVAC systems, e.g., more energy-efficient and flexible to response to the rapid change of weather. This places demands on the rapid modeling and prediction of zone air temperatures of buildings. Compared to the traditional simulation-based approach such as EnergyPlus and DOE2, a hybrid approach combined physics and data-driven is more suitable. Recently, the availability of high-quality datasets and algorithmic breakthroughs have driven a considerable amount of work in this field. However, in the niche of short- and long-term predictions, there are still some gaps in existing research. This paper aims to develop a time series forecast model to predict the zone air temperature in a building located in America on a 2-week horizon. The findings could be further improved to support intelligent control and operation of HVAC systems (i.e. demand flexibility) and could also be used as hybrid building energy modeling.

[674] Mixture-of-Experts with Gradient Conflict-Driven Subspace Topology Pruning for Emergent Modularity

Yuxing Gan, Ziyu Lei

Main category: cs.LG

TL;DR: CDSP-MoE addresses catastrophic forgetting and instruction-overfitting in MoE architectures by shifting from isolated experts to dynamic expert instantiation within a shared subspace, using gradient conflicts to prune interfering connections and enable interpretable modular structures.

Details

Motivation: Current MoE architectures suffer from structural parameter isolation causing catastrophic forgetting, and instruction-overfitting that degrades performance in instruction-free scenarios. There's a need for more robust architectures that maintain semantic specialization without explicit task labels.

Method: CDSP-MoE maintains a super-complete parameter backbone where logical experts are carved out via learnable topology masks. It uses a Lagged Gradient Game that penalizes interfering connections in the shared manifold, enabling spontaneous pruning of conflicting pathways and evolution of interpretable modular structures.

Result: CDSP-MoE achieves robust content-driven routing without human-defined task labels, maintaining semantic specialization even under strict blind inference protocols where explicit instructions are absent.

Conclusion: The proposed CDSP-MoE framework successfully addresses fundamental limitations of current MoE architectures through a paradigm shift to dynamic expert instantiation within shared subspaces, using gradient conflicts as structural supervisory signals for improved modularity and robustness.

Abstract: Mixture-of-Experts (MoE) architectures achieve parameter efficiency through conditional computation, yet contemporary designs suffer from two fundamental limitations: structural parameter isolation that causes catastrophic forgetting, and instruction-overfitting that degrades performance in instruction-free scenarios. We propose CDSP-MoE (Conflict-Driven Subspace Pruning MoE), a framework that addresses these issues through a paradigm shift from isolated expert containers to dynamic expert instantiation within a shared physical subspace. Grounded in the Universal Weight Subspace Hypothesis, CDSP-MoE maintains a super-complete parameter backbone where logical experts are carved out via learnable topology masks. Unlike prior work that uses gradient conflict for token reassignment or optimization surgery, we leverage it as a structural supervisory signal: a Lagged Gradient Game penalizes interfering connections in the shared manifold, enabling the topology to spontaneously prune conflicting pathways and evolve interpretable modular structures. Experimental results demonstrate that CDSP-MoE achieves robust content-driven routing without human-defined task labels, maintaining semantic specialization even under strict blind inference protocols where explicit instructions are absent. Code is available at: https://github.com/konodiodaaaaa1/Conflict-Driven-Subspace-Pruning-Mixture-of-Experts

cs.MA

[675] ReCollab: Retrieval-Augmented LLMs for Cooperative Ad-hoc Teammate Modeling

Conor Wallace, Umer Siddique, Yongcan Cao

Main category: cs.MA

TL;DR: LLM-based framework for ad-hoc teamwork uses behavior rubrics and retrieval-augmented generation to classify teammate types and improve adaptation in cooperative environments.

Details

Motivation: Conventional approaches to ad-hoc teamwork rely on fixed probabilistic models or classifiers that can be brittle under partial observability and limited interaction. LLMs offer a flexible alternative by serving as behavioral world models that can map short behavioral traces into high-level hypotheses about teammate behavior.

Method: Introduces Collab, a language-based framework that classifies partner types using a behavior rubric derived from trajectory features. Extends to ReCollab which incorporates retrieval-augmented generation (RAG) to stabilize inference with exemplar trajectories.

Result: In the cooperative Overcooked environment, Collab effectively distinguishes teammate types, while ReCollab consistently improves adaptation across layouts, achieving Pareto-optimal trade-offs between classification accuracy and episodic return.

Conclusion: LLMs show potential as behavioral world models for ad-hoc teamwork, and retrieval grounding is important for challenging coordination settings.

Abstract: Ad-hoc teamwork (AHT) requires agents to infer the behavior of previously unseen teammates and adapt their policy accordingly. Conventional approaches often rely on fixed probabilistic models or classifiers, which can be brittle under partial observability and limited interaction. Large language models (LLMs) offer a flexible alternative: by mapping short behavioral traces into high-level hypotheses, they can serve as world models over teammate behavior. We introduce \Collab, a language-based framework that classifies partner types using a behavior rubric derived from trajectory features, and extend it to \ReCollab, which incorporates retrieval-augmented generation (RAG) to stabilize inference with exemplar trajectories. In the cooperative Overcooked environment, \Collab effectively distinguishes teammate types, while \ReCollab consistently improves adaptation across layouts, achieving Pareto-optimal trade-offs between classification accuracy and episodic return. These findings demonstrate the potential of LLMs as behavioral world models for AHT and highlight the importance of retrieval grounding in challenging coordination settings.

[676] Solving Multi-Agent Multi-Goal Path Finding Problems in Polynomial Time

Stefan Edelkamp

Main category: cs.MA

TL;DR: Multi-agent mission planning with dynamic goal assignment in graphs, achieving polynomial-time solutions for conflict-free routing despite traditional NP-hardness

Details

Motivation: Traditional multi-agent path finding requires pre-assigned goals, but real-world scenarios need dynamic goal assignment. Vehicle routing on graphs is NP-hard, but the authors aim to find efficient solutions for conflict-free multi-agent routing with automatic goal assignment.

Method: For continuous case: point agents in Euclidean plane solved arbitrarily close to optimal. For discrete variants: polynomial-time solutions using global assignment strategies to reduce conflicts, with remaining conflicts resolved via ants-on-the-stick concept, local assignment problems, path interleaving, and destination kicking.

Result: Shows that discrete variants with node/edge conflicts can be solved in polynomial time (unexpected result since traditional vehicle routing is NP-hard). Implements a planner that finds conflict-free optimized routes with global assignment strategies greatly reducing conflicts.

Conclusion: The paper presents an efficient approach to multi-agent mission planning with dynamic goal assignment that achieves polynomial-time solutions for conflict-free routing in graphs, overcoming traditional NP-hardness through novel conflict resolution strategies.

Abstract: In this paper, we plan missions for a fleet of agents in undirected graphs, such as grids, with multiple goals. In contrast to regular multi-agent path-finding, the solver finds and updates the assignment of goals to the agents on its own. In the continuous case for a point agent with motions in the Euclidean plane, the problem can be solved arbitrarily close to optimal. For discrete variants that incur node and edge conflicts, we show that it can be solved in polynomial time, which is unexpected, since traditional vehicle routing on general graphs is NP-hard. We implement a corresponding planner that finds conflict-free optimized routes for the agents. Global assignment strategies greatly reduce the number of conflicts, with the remaining ones resolved by elaborating on the concept of ants-on-the-stick, by solving local assignment problems, by interleaving agent paths, and by kicking agents that have already arrived out of their destinations

[677] Hierarchical Pedagogical Oversight: A Multi-Agent Adversarial Framework for Reliable AI Tutoring

Saisab Sadhu, Ashim Dhor

Main category: cs.MA

TL;DR: HPO framework uses adversarial synthesis for educational assessment, achieving better performance than GPT-4o with 20x fewer parameters on math tutoring tasks.

Details

Motivation: LLMs deployed as automated tutors often fail at pedagogical reasoning, showing sycophancy (validating incorrect solutions) or providing overly direct answers that hinder learning, especially in resource-constrained environments.

Method: Hierarchical Pedagogical Oversight (HPO) adapts structured adversarial synthesis to education. It uses specialist agents to distill dialogue context, then grounds a moderated five-act debate between opposing pedagogical critics, enforcing dialectical separation of concerns.

Result: On the MRBench dataset of 1,214 middle-school mathematics dialogues, the 8B-parameter model achieves Macro F1 of 0.845, outperforming GPT-4o (0.812) by 3.3% while using 20 times fewer parameters.

Conclusion: Adversarial reasoning is a critical mechanism for deploying reliable, low-compute pedagogical oversight in resource-constrained environments, establishing HPO as an effective framework for educational assessment.

Abstract: Large Language Models (LLMs) are increasingly deployed as automated tutors to address educator shortages; however, they often fail at pedagogical reasoning, frequently validating incorrect student solutions (sycophancy) or providing overly direct answers that hinder learning. We introduce Hierarchical Pedagogical Oversight (HPO), a framework that adapts structured adversarial synthesis to educational assessment. Unlike cooperative multi-agent systems that often drift toward superficial consensus, HPO enforces a dialectical separation of concerns: specialist agents first distill dialogue context, which then grounds a moderated, five-act debate between opposing pedagogical critics. We evaluate this framework on the MRBench dataset of 1,214 middle-school mathematics dialogues. Our 8B-parameter model achieves a Macro F1 of 0.845, outperforming GPT-4o (0.812) by 3.3% while using 20 times fewer parameters. These results establish adversarial reasoning as a critical mechanism for deploying reliable, low-compute pedagogical oversight in resource-constrained environments.

[678] MARPO: A Reflective Policy Optimization for Multi Agent Reinforcement Learning

Cuiling Wu, Yaozhong Gan, Junliang Xing, Ying Fu

Main category: cs.MA

TL;DR: MARPO improves multi-agent RL sample efficiency with reflection mechanism and asymmetric clipping for better stability.

Details

Motivation: Address sample inefficiency in multi-agent reinforcement learning, which is a key challenge in MARL due to complex interactions between agents and high-dimensional state-action spaces.

Method: Two key components: 1) Reflection mechanism that leverages subsequent trajectories to enhance sample efficiency, 2) Asymmetric clipping mechanism derived from KL divergence that dynamically adjusts clipping range to improve training stability.

Result: MARPO consistently outperforms other methods in classic multi-agent environments, demonstrating improved performance and efficiency.

Conclusion: MARPO effectively addresses sample inefficiency in multi-agent RL through innovative reflection and adaptive clipping mechanisms, offering a promising approach for more efficient and stable multi-agent learning.

Abstract: We propose Multi Agent Reflective Policy Optimization (MARPO) to alleviate the issue of sample inefficiency in multi agent reinforcement learning. MARPO consists of two key components: a reflection mechanism that leverages subsequent trajectories to enhance sample efficiency, and an asymmetric clipping mechanism that is derived from the KL divergence and dynamically adjusts the clipping range to improve training stability. We evaluate MARPO in classic multi agent environments, where it consistently outperforms other methods.

[679] Reinforcement Networks: novel framework for collaborative Multi-Agent Reinforcement Learning tasks

Maksim Kryzhanovskiy, Svetlana Glazyrina, Roman Ischenko, Konstantin Vorontsov

Main category: cs.MA

TL;DR: Reinforcement Networks: A framework organizing agents as vertices in directed acyclic graphs for flexible, scalable multi-agent RL without restrictive architectural assumptions.

Details

Motivation: Modern AI systems often have multiple learnable components naturally organized as graphs, but current approaches have limitations like restrictive topologies, fully centralized training, and architectural constraints. There's a need for end-to-end training of such systems without these limitations.

Method: Introduces Reinforcement Networks framework where agents are organized as vertices in directed acyclic graphs (DAGs). Extends hierarchical RL to arbitrary DAGs, enabling flexible credit assignment and scalable coordination. Formalizes training and inference methods, connects to LevelEnv concept for reproducible construction, training, and evaluation.

Result: Demonstrated effectiveness on several collaborative MARL setups by developing Reinforcement Networks models that achieve improved performance over standard MARL baselines. The framework unifies hierarchical, modular, and graph-structured views of MARL.

Conclusion: Reinforcement Networks provide a principled path toward designing and training complex multi-agent systems, opening a new line of research in scalable, structured MARL. Future directions include richer graph morphologies, compositional curricula, and graph-aware exploration.

Abstract: Modern AI systems often comprise multiple learnable components that can be naturally organized as graphs. A central challenge is the end-to-end training of such systems without restrictive architectural or training assumptions. Such tasks fit the theory and approaches of the collaborative Multi-Agent Reinforcement Learning (MARL) field. We introduce Reinforcement Networks, a general framework for MARL that organizes agents as vertices in a directed acyclic graph (DAG). This structure extends hierarchical RL to arbitrary DAGs, enabling flexible credit assignment and scalable coordination while avoiding strict topologies, fully centralized training, and other limitations of current approaches. We formalize training and inference methods for the Reinforcement Networks framework and connect it to the LevelEnv concept to support reproducible construction, training, and evaluation. We demonstrate the effectiveness of our approach on several collaborative MARL setups by developing several Reinforcement Networks models that achieve improved performance over standard MARL baselines. Beyond empirical gains, Reinforcement Networks unify hierarchical, modular, and graph-structured views of MARL, opening a principled path toward designing and training complex multi-agent systems. We conclude with theoretical and practical directions - richer graph morphologies, compositional curricula, and graph-aware exploration. That positions Reinforcement Networks as a foundation for a new line of research in scalable, structured MARL.

[680] Heterogeneity in Multi-Agent Reinforcement Learning

Tianyi Hu, Zhiqiang Pu, Yuan Wang, Tenghai Qiu, Min Chen, Xin Yu

Main category: cs.MA

TL;DR: This paper provides a systematic framework for understanding, quantifying, and utilizing heterogeneity in multi-agent reinforcement learning (MARL), including definitions, measurement methods, and practical algorithms.

Details

Motivation: Heterogeneity is fundamental in MARL but lacks rigorous definition and deeper understanding. The field needs systematic approaches to define, quantify, and leverage agent heterogeneity for better algorithm design and interpretability.

Method: 1) Categorize heterogeneity into five types with mathematical definitions based on agent-level modeling. 2) Define heterogeneity distance and propose practical quantification methods. 3) Design heterogeneity-based multi-agent dynamic parameter sharing algorithm as application example.

Result: Case studies show the method effectively identifies and quantifies various agent heterogeneity types. Experimental results demonstrate the proposed algorithm has better interpretability and stronger adaptability compared to other parameter sharing baselines.

Conclusion: The proposed methodology helps the MARL community gain comprehensive understanding of heterogeneity and promotes development of practical algorithms with improved interpretability and adaptability.

Abstract: Heterogeneity is a fundamental property in multi-agent reinforcement learning (MARL), which is closely related not only to the functional differences of agents, but also to policy diversity and environmental interactions. However, the MARL field currently lacks a rigorous definition and deeper understanding of heterogeneity. This paper systematically discusses heterogeneity in MARL from the perspectives of definition, quantification, and utilization. First, based on an agent-level modeling of MARL, we categorize heterogeneity into five types and provide mathematical definitions. Second, we define the concept of heterogeneity distance and propose a practical quantification method. Third, we design a heterogeneity-based multi-agent dynamic parameter sharing algorithm as an example of the application of our methodology. Case studies demonstrate that our method can effectively identify and quantify various types of agent heterogeneity. Experimental results show that the proposed algorithm, compared to other parameter sharing baselines, has better interpretability and stronger adaptability. The proposed methodology will help the MARL community gain a more comprehensive and profound understanding of heterogeneity, and further promote the development of practical algorithms.

[681] Assessing behaviour coverage in a multi-agent system simulation for autonomous vehicle testing

Manuel Franco-Vivo

Main category: cs.MA

TL;DR: This paper presents a behavior coverage analysis framework for autonomous vehicle testing in multi-agent simulations, proposing an MPC-based pedestrian agent to generate more realistic and interesting test scenarios.

Details

Motivation: As autonomous vehicles advance, comprehensive testing methodologies are essential to ensure safety and reliability in diverse real-world scenarios. Current simulation environments need better ways to measure and assess behavior coverage to validate system effectiveness and robustness.

Method: The study develops a systematic approach to measure behavior coverage in multi-agent simulations by defining driving scenarios and agent interactions. It introduces a Model Predictive Control (MPC) pedestrian agent with an objective function designed to generate interesting tests while promoting realistic behavior compared to previous pedestrian models.

Result: The analysis of behavior coverage metrics and coverage-based testing identifies key areas for improvement in simulation frameworks. The proposed MPC pedestrian agent successfully encourages more interesting tests and realistic behaviors than previously studied pedestrian agents.

Conclusion: This research advances autonomous vehicle testing by providing insights into comprehensive behavior evaluation in simulated environments. The findings offer valuable implications for enhancing safety, reliability, and performance through rigorous testing methodologies that better capture real-world complexity.

Abstract: As autonomous vehicle technology advances, ensuring the safety and reliability of these systems becomes paramount. Consequently, comprehensive testing methodologies are essential to evaluate the performance of autonomous vehicles in diverse and complex real-world scenarios. This study focuses on the behaviour coverage analysis of a multi-agent system simulation designed for autonomous vehicle testing, and provides a systematic approach to measure and assess behaviour coverage within the simulation environment. By defining a set of driving scenarios, and agent interactions, we evaluate the extent to which the simulation encompasses a broad range of behaviours relevant to autonomous driving. Our findings highlight the importance of behaviour coverage in validating the effectiveness and robustness of autonomous vehicle systems. Through the analysis of behaviour coverage metrics and coverage-based testing, we identify key areas for improvement and optimization in the simulation framework. Thus, a Model Predictive Control (MPC) pedestrian agent is proposed, where its objective function is formulated to encourage \textit{interesting} tests while promoting a more realistic behaviour than other previously studied pedestrian agents. This research contributes to advancing the field of autonomous vehicle testing by providing insights into the comprehensive evaluation of system behaviour in simulated environments. The results offer valuable implications for enhancing the safety, reliability, and performance of autonomous vehicles through rigorous testing methodologies.

[682] Towards Global Optimality in Cooperative MARL with the Transformation And Distillation Framework

Jianing Ye, Chenghao Li, Yongqiang Dou, Jianhao Wang, Guangwen Yang, Chongjie Zhang

Main category: cs.MA

TL;DR: TAD framework addresses suboptimality in decentralized MARL by transforming multi-agent MDP to sequential single-agent MDP and distilling policies, with TAD-PPO showing strong empirical performance.

Details

Motivation: Most MARL algorithms use decentralized policies with gradient descent but lack theoretical analysis, and are found to be suboptimal in practice despite enabling decentralized execution.

Method: Proposes Transformation And Distillation (TAD) framework: 1) Reformulates multi-agent MDP as sequential single-agent MDP, 2) Distills learned policy for decentralized execution. Two-stage learning paradigm with TAD-PPO implementation based on PPO.

Result: Theoretical proof of suboptimality for existing decentralized MARL algorithms. TAD-PPO shows significant outperformance across diverse tasks: matrix games, hallway tasks, StarCraft II, and football games.

Conclusion: TAD framework provides optimality guarantee for cooperative MARL while maintaining decentralized execution, addressing fundamental optimization issues in existing decentralized policy approaches.

Abstract: Decentralized execution is one core demand in multi-agent reinforcement learning (MARL). Recently, most popular MARL algorithms have adopted decentralized policies to enable decentralized execution, and use gradient descent as the optimizer. However, there is hardly any theoretical analysis of these algorithms taking the optimization method into consideration, and we find that various popular MARL algorithms with decentralized policies are suboptimal in toy tasks when gradient descent is chosen as their optimization method. In this paper, we theoretically analyze two common classes of algorithms with decentralized policies – multi-agent policy gradient methods and value-decomposition methods, and prove their suboptimality when gradient descent is used. To address the suboptimality issue, we propose the Transformation And Distillation (TAD) framework, which reformulates a multi-agent MDP as a special single-agent MDP with a sequential structure and enables decentralized execution by distilling the learned policy on the derived “single-agent” MDP. The approach is a two-stage learning paradigm that addresses the optimization problem in cooperative MARL, providing optimality guarantee with decent execution performance. Empirically, we implement TAD-PPO based on PPO, which can theoretically perform optimal policy learning in the finite multi-agent MDPs and shows significant outperformance on a large set of cooperative multi-agent tasks, from matrix game, hallway task, to StarCraft II, and football game.

[683] QLLM: Do We Really Need a Mixing Network for Credit Assignment in Multi-Agent Reinforcement Learning?

Zhouyang Jiang, Bin Zhang, Yuanjun Li, Zhiwei Xu

Main category: cs.MA

TL;DR: QLLM uses large language models to automatically construct credit assignment functions for multi-agent reinforcement learning, addressing limitations of traditional value decomposition methods.

Details

Motivation: Traditional MARL credit assignment methods suffer from imprecise contribution attribution, limited interpretability, and poor scalability in high-dimensional spaces. Existing value decomposition approaches using neural networks have fundamental limitations in capturing the nonlinear relationship between individual and global Q-values.

Method: QLLM leverages LLMs to automatically construct credit assignment functions through TFCAF (direct nonlinear functional formulation), uses a coder-evaluator framework for code generation/verification to reduce hallucination, and implements an IGM-Gating Mechanism for flexible monotonicity constraint enforcement.

Result: Extensive experiments on standard MARL benchmarks show QLLM consistently outperforms state-of-the-art baselines, exhibits strong generalization capability, and maintains compatibility with various MARL algorithms using mixing networks.

Conclusion: QLLM presents a promising and versatile solution for complex multi-agent scenarios by combining LLMs’ expressive power with MARL credit assignment, offering improved performance, interpretability, and scalability over traditional methods.

Abstract: Credit assignment has remained a fundamental challenge in multi-agent reinforcement learning (MARL). Previous studies have primarily addressed this issue through value decomposition methods under the centralized training with decentralized execution paradigm, where neural networks are utilized to approximate the nonlinear relationship between individual Q-values and the global Q-value. Although these approaches have achieved considerable success in various benchmark tasks, they still suffer from several limitations, including imprecise attribution of contributions, limited interpretability, and poor scalability in high-dimensional state spaces. To address these challenges, we propose a novel algorithm, QLLM, which facilitates the automatic construction of credit assignment functions using large language models (LLMs). Specifically, the concept of TFCAF is introduced, wherein the credit allocation process is represented as a direct and expressive nonlinear functional formulation. A custom-designed coder-evaluator framework is further employed to guide the generation and verification of executable code by LLMs, significantly mitigating issues such as hallucination and shallow reasoning during inference. Furthermore, an IGM-Gating Mechanism enables QLLM to flexibly enforce or relax the monotonicity constraint depending on task demands, covering both IGM-compliant and non-monotonic scenarios. Extensive experiments conducted on several standard MARL benchmarks demonstrate that the proposed method consistently outperforms existing state-of-the-art baselines. Moreover, QLLM exhibits strong generalization capability and maintains compatibility with a wide range of MARL algorithms that utilize mixing networks, positioning it as a promising and versatile solution for complex multi-agent scenarios. The code is available at https://github.com/zhouyangjiang71-sys/QLLM.

[684] MTTR-A: Measuring Cognitive Recovery Latency in Multi-Agent Systems

Barak Or

Main category: cs.MA

TL;DR: MTTR-A is a new runtime reliability metric that measures cognitive recovery latency in multi-agent systems built on LLMs, addressing cognitive failures rather than infrastructure faults.

Details

Motivation: Current reliability limitations in LLM-based multi-agent systems stem from cognitive failures rather than infrastructure faults, and existing observability tools only describe failures without quantifying how quickly distributed reasoning recovers once coherence is lost.

Method: The authors introduce MTTR-A (Mean Time-to-Recovery for Agentic Systems), adapting classical dependability theory to agentic orchestration to capture reasoning drift detection and recovery time. They also define complementary metrics (MTBF, normalized recovery ratio) and establish theoretical bounds linking recovery latency to cognitive uptime. They validate using a LangGraph-based benchmark with simulated drift and reflex recovery.

Result: The paper demonstrates measurable recovery behavior across multiple reflex strategies using their benchmark, establishing quantitative foundations for runtime cognitive dependability in distributed agentic systems.

Conclusion: This work provides a quantitative foundation for assessing runtime cognitive dependability in distributed agentic systems, moving beyond descriptive failure analysis to measurable recovery metrics.

Abstract: Reliability in multi-agent systems (MAS) built on large language models is increasingly limited by cognitive failures rather than infrastructure faults. Existing observability tools describe failures but do not quantify how quickly distributed reasoning recovers once coherence is lost. We introduce MTTR-A (Mean Time-to-Recovery for Agentic Systems), a runtime reliability metric that measures cognitive recovery latency in MAS. MTTR-A adapts classical dependability theory to agentic orchestration, capturing the time required to detect reasoning drift and restore coherent operation. We further define complementary metrics, including MTBF and a normalized recovery ratio (NRR), and establish theoretical bounds linking recovery latency to long-run cognitive uptime. Using a LangGraph-based benchmark with simulated drift and reflex recovery, we empirically demonstrate measurable recovery behavior across multiple reflex strategies. This work establishes a quantitative foundation for runtime cognitive dependability in distributed agentic systems.

cs.MM

[685] Mesquite MoCap: Democratizing Real-Time Motion Capture with Affordable, Bodyworn IoT Sensors and WebXR SLAM

Poojan Vanani, Darsh Patel, Danyal Khorami, Siva Munaganuru, Pavan Reddy, Varun Reddy, Bhargav Raghunath, Ishrat Lallmamode, Romir Patel, Assegid Kidané, Tejaswi Gowda

Main category: cs.MM

TL;DR: Mesquite is an open-source, low-cost inertial motion capture system using 15 IMU sensors and smartphone SLAM that runs entirely in web browsers, achieving 2-5° joint-angle error at 5% of commercial optical system cost.

Details

Motivation: Motion capture systems are expensive and complex, limiting accessibility outside specialized labs. There's a need for affordable, accessible motion capture technology for broader applications in entertainment, biomechanics, healthcare, HCI, and VR.

Method: Combines 15 IMU sensor nodes worn on the body with a hip-worn Android smartphone for position tracking. Uses low-power wireless streaming of quaternion orientations to a USB dongle and browser-based application built on modern web technologies (WebGL, WebXR for SLAM, WebSerial, WebSockets, Progressive Web Apps).

Result: Achieves mean joint-angle error of 2-5 degrees compared to commercial optical systems at approximately 5% of the cost. Sustains 30 FPS with end-to-end latency under 15ms and packet delivery rate ≥99.7% in standard indoor environments.

Conclusion: Mesquite successfully lowers the barrier to motion capture by leveraging IoT principles, edge processing, and web-native technology. The system is released as open-source (GNU GPL) including hardware designs, firmware, and software.

Abstract: Motion capture remains costly and complex to deploy, limiting use outside specialized laboratories. We present Mesquite, an open-source, low-cost inertial motion-capture system that combines a body-worn network of 15 IMU sensor nodes with a hip-worn Android smartphone for position tracking. A low-power wireless link streams quaternion orientations to a central USB dongle and a browser-based application for real-time visualization and recording. Built on modern web technologies – WebGL for rendering, WebXR for SLAM, WebSerial and WebSockets for device and network I/O, and Progressive Web Apps for packaging – the system runs cross-platform entirely in the browser. In benchmarks against a commercial optical system, Mesquite achieves mean joint-angle error of 2-5 degrees while operating at approximately 5% of the cost. The system sustains 30 frames per second with end-to-end latency under 15ms and a packet delivery rate of at least 99.7% in standard indoor environments. By leveraging IoT principles, edge processing, and a web-native stack, Mesquite lowers the barrier to motion capture for applications in entertainment, biomechanics, healthcare monitoring, human-computer interaction, and virtual reality. We release hardware designs, firmware, and software under an open-source license (GNU GPL).

[686] Multi Agents Semantic Emotion Aligned Music to Image Generation with Music Derived Captions

Junchang Shi, Gang Li

Main category: cs.MM

TL;DR: MESA MIG is a multi-agent framework that generates images from music by producing structured captions and refining them with specialized agents, while also aligning semantic and emotional content through valence-arousal predictions.

Details

Motivation: People often experience visual imagery when listening to music, but there's a need to externalize this inner imagery by generating images that are semantically and emotionally aligned with the music.

Method: A multi-agent framework that first generates structured music captions, then refines them with specialized agents for scene, motion, style, color, and composition. Uses parallel valence-arousal regression from music and CLIP-based visual VA estimation to enforce semantic and emotional alignment.

Result: Outperforms caption-only and single-agent baselines in aesthetic quality, semantic consistency, and VA alignment. Achieves competitive emotion regression performance compared to state-of-the-art music and image emotion models.

Conclusion: MESA MIG successfully generates visually rich images from music by combining multi-agent caption refinement with emotional alignment mechanisms, effectively externalizing the visual imagery people experience when listening to music.

Abstract: When people listen to music, they often experience rich visual imagery. We aim to externalize this inner imagery by generating images conditioned on music. We propose MESA MIG, a multi agent semantic and emotion aligned framework that first produces structured music captions and then refines them with cooperating agents specializing in scene, motion, style, color, and composition. In parallel, a Valence Arousal regression head predicts continuous affective states from music, while a CLIP based visual VA head estimates emotions from images. These components jointly enforce semantic and emotional alignment between music and synthesized images. Experiments on curated music image pairs show that MESA MIG outperforms caption only and single agent baselines in aesthetic quality, semantic consistency, and VA alignment, and achieves competitive emotion regression performance compared with state of the art music and image emotion models.

[687] Unlocking WebRTC for End User Driven Innovation

Kundan Singh

Main category: cs.MM

TL;DR: RTC Helper is a browser extension that intercepts WebRTC APIs to enable real-time customization of web communication apps without rebuilding or redeploying.

Details

Motivation: To enable end-user innovation and rapid prototyping for web multimedia communication applications, allowing customization of third-party web apps without requiring code changes or redeployment.

Method: Developed a browser extension that intercepts WebRTC and related APIs in the browser, allowing real-time behavior modification of web applications with over 100 built-in examples across 10+ customization categories.

Result: Created a functional software architecture that enables both end users and developers to customize web communication apps in real-time, supporting a wide range of novel use cases in web-based audio/video communication.

Conclusion: RTC Helper provides a flexible, general-purpose solution for enabling innovation in web communication applications by allowing real-time customization without the need for rebuilding or redeploying applications.

Abstract: We present a software architecture to enable end user driven innovation of web multimedia communication applications. RTC Helper is a simple and easy-to-use software that can intercept WebRTC (web real-time communication) and related APIs in the browser, and change the behavior of web apps in real-time. Such customization can even be driven by the end user on third-party web apps using our flexible and general purpose browser extension. It also facilitates rapid prototyping of ideas by web developers in their existing web apps without having to rebuild or redeploy after every change. It has more than ten customization categories, and over a hundred built-in examples covering a wide range of novel use cases in web-based audio/video communication.

[688] Plasticity-Aware Mixture of Experts for Learning Under QoE Shifts in Adaptive Video Streaming

Zhiqiang He, Zhi Liu

Main category: cs.MM

TL;DR: PA-MoE framework addresses plasticity loss in neural networks for adaptive video streaming by dynamically balancing memory retention with selective forgetting through noise injection, achieving 45.5% QoE improvement.

Details

Motivation: Traditional neural networks struggle with plasticity loss when adapting to user-specific QoE functions in video streaming, as different users and content require varying optimization objectives that evolve over time.

Method: Proposes Plasticity-Aware Mixture of Experts (PA-MoE) framework that dynamically modulates network plasticity using noise injection to promote selective forgetting of outdated knowledge while retaining useful information.

Result: PA-MoE achieves 45.5% improvement in QoE over baselines in dynamic streaming environments, effectively mitigates plasticity loss by optimizing neuron utilization, and experimental results align with theoretical predictions.

Conclusion: PA-MoE successfully addresses plasticity loss in neural networks for adaptive video streaming through dynamic plasticity modulation, enabling better adaptation to evolving user-specific QoE objectives with theoretical guarantees.

Abstract: Adaptive video streaming systems are designed to optimize Quality of Experience (QoE) and, in turn, enhance user satisfaction. However, differences in user profiles and video content lead to different weights for QoE factors, resulting in user-specific QoE functions and, thus, varying optimization objectives. This variability poses significant challenges for neural networks, as they often struggle to generalize under evolving targets - a phenomenon known as plasticity loss that prevents conventional models from adapting effectively to changing optimization objectives. To address this limitation, we propose the Plasticity-Aware Mixture of Experts (PA-MoE), a novel learning framework that dynamically modulates network plasticity by balancing memory retention with selective forgetting. In particular, PA-MoE leverages noise injection to promote the selective forgetting of outdated knowledge, thereby endowing neural networks with enhanced adaptive capabilities. In addition, we present a rigorous theoretical analysis of PA-MoE by deriving a regret bound that quantifies its learning performance. Experimental evaluations demonstrate that PA-MoE achieves a 45.5% improvement in QoE over competitive baselines in dynamic streaming environments. Further analysis reveals that the model effectively mitigates plasticity loss by optimizing neuron utilization. Finally, a parameter sensitivity study is performed by injecting varying levels of noise, and the results align closely with our theoretical predictions.

eess.AS

[689] Geometry-Aware Optimization for Respiratory Sound Classification: Enhancing Sensitivity with SAM-Optimized Audio Spectrogram Transformers

Atakan Işık, Selin Vulga Işık, Ahmet Feridun Işık, Mahşuk Taylan

Main category: eess.AS

TL;DR: AST + SAM framework achieves SOTA 68.10% on ICBHI 2017 respiratory sound classification by finding flatter minima to prevent overfitting on noisy, imbalanced medical data.

Details

Motivation: Respiratory sound classification faces challenges from small, noisy, imbalanced datasets like ICBHI 2017. Transformers overfit and converge to sharp minima on such constrained medical data, hurting generalization.

Method: Enhances Audio Spectrogram Transformer (AST) with Sharpness-Aware Minimization (SAM) to optimize loss surface geometry toward flatter minima. Also uses weighted sampling to handle class imbalance.

Result: Achieves state-of-the-art 68.10% on ICBHI 2017 dataset, outperforming CNN/hybrid baselines. Key improvement: 68.31% sensitivity for reliable clinical screening. t-SNE and attention maps confirm robust feature learning.

Conclusion: SAM-enhanced AST framework effectively addresses overfitting in medical audio classification by finding flatter minima, improving generalization on noisy, imbalanced datasets while maintaining clinical reliability.

Abstract: Respiratory sound classification is hindered by the limited size, high noise levels, and severe class imbalance of benchmark datasets like ICBHI 2017. While Transformer-based models offer powerful feature extraction capabilities, they are prone to overfitting and often converge to sharp minima in the loss landscape when trained on such constrained medical data. To address this, we introduce a framework that enhances the Audio Spectrogram Transformer (AST) using Sharpness-Aware Minimization (SAM). Instead of merely minimizing the training loss, our approach optimizes the geometry of the loss surface, guiding the model toward flatter minima that generalize better to unseen patients. We also implement a weighted sampling strategy to handle class imbalance effectively. Our method achieves a state-of-the-art score of 68.10% on the ICBHI 2017 dataset, outperforming existing CNN and hybrid baselines. More importantly, it reaches a sensitivity of 68.31%, a crucial improvement for reliable clinical screening. Further analysis using t-SNE and attention maps confirms that the model learns robust, discriminative features rather than memorizing background noise.

[690] Spatial Interpolation of Room Impulse Responses based on Deeper Physics-Informed Neural Networks with Residual Connections

Ken Kurata, Gen Sato, Izumi Tsunokuni, Yusuke Ikeda

Main category: eess.AS

TL;DR: Deeper PINN architecture with residual connections and sinusoidal activations improves RIR estimation accuracy for both interpolation and extrapolation tasks.

Details

Motivation: Estimating room impulse responses (RIRs) from limited measurements is important for sound propagation analysis, but the role of network depth in physics-informed neural networks (PINNs) for this task hasn't been systematically studied.

Method: Developed a deeper PINN architecture with residual connections and compared different activation functions (tanh vs sinusoidal) to analyze how network depth affects RIR estimation performance.

Result: Residual PINN with sinusoidal activations achieved highest accuracy for both interpolation and extrapolation of RIRs, enabled stable training as depth increases, and improved estimation of reflection components.

Conclusion: The proposed architecture provides practical guidelines for designing deep and stable PINNs for acoustic-inverse problems, with residual connections and sinusoidal activations being key to improved performance.

Abstract: The room impulse response (RIR) characterizes sound propagation in a room from a loudspeaker to a microphone under the linear time-invariant assumption. Estimating RIRs from a limited number of measurement points is crucial for sound propagation analysis and visualization. Physics-informed neural networks (PINNs) have recently been introduced for accurate RIR estimation by embedding governing physical laws into deep learning models; however, the role of network depth has not been systematically investigated. In this study, we developed a deeper PINN architecture with residual connections and analyzed how network depth affects estimation performance. We further compared activation functions, including tanh and sinusoidal activations. Our results indicate that the residual PINN with sinusoidal activations achieves the highest accuracy for both interpolation and extrapolation of RIRs. Moreover, the proposed architecture enables stable training as the depth increases and yields notable improvements in estimating reflection components. These results provide practical guidelines for designing deep and stable PINNs for acoustic-inverse problems.

[691] Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation

Zengwei Yao, Wei Kang, Han Zhu, Liyong Guo, Lingxuan Ye, Fangjun Kuang, Weiji Zhuang, Zhaoqing Li, Zhifeng Han, Long Lin, Daniel Povey

Main category: eess.AS

TL;DR: Flow2GAN: Two-stage audio generation framework combining Flow Matching training with GAN fine-tuning for efficient high-quality audio synthesis.

Details

Motivation: Existing audio generation methods have limitations - GANs suffer from slow convergence and mode collapse, while diffusion/Flow Matching methods require computationally expensive multi-step inference. Need better quality-efficiency trade-off.

Method: Two-stage framework: 1) Improved Flow Matching for audio with endpoint estimation objective and spectral energy-based loss scaling, 2) Lightweight GAN fine-tuning to enable one-step generation. Plus multi-branch network architecture for different time-frequency resolutions.

Result: Achieves high-fidelity audio generation from Mel-spectrograms or discrete tokens, with better quality-efficiency trade-offs than state-of-the-art GAN and Flow Matching methods.

Conclusion: Flow2GAN successfully combines the strengths of Flow Matching (generative capability) and GANs (efficient inference) for audio generation, with architectural improvements for audio-specific properties.

Abstract: Existing dominant methods for audio generation include Generative Adversarial Networks (GANs) and diffusion-based methods like Flow Matching. GANs suffer from slow convergence and potential mode collapse during training, while diffusion methods require multi-step inference that introduces considerable computational overhead. In this work, we introduce Flow2GAN, a two-stage framework that combines Flow Matching training for learning generative capabilities with GAN fine-tuning for efficient few-step inference. Specifically, given audio’s unique properties, we first improve Flow Matching for audio modeling through: 1) reformulating the objective as endpoint estimation, avoiding velocity estimation difficulties when involving empty regions; 2) applying spectral energy-based loss scaling to emphasize perceptually salient quieter regions. Building on these Flow Matching adaptations, we demonstrate that a further stage of lightweight GAN fine-tuning enables us to obtain one-step generator that produces high-quality audio. In addition, we develop a multi-branch network architecture that processes Fourier coefficients at different time-frequency resolutions, which improves the modeling capabilities compared to prior single-resolution designs. Experimental results indicate that our Flow2GAN delivers high-fidelity audio generation from Mel-spectrograms or discrete audio tokens, achieving better quality-efficiency trade-offs than existing state-of-the-art GAN-based and Flow Matching-based methods. Online demo samples are available at https://flow2gan.github.io, and the source code is released at https://github.com/k2-fsa/Flow2GAN.

Dhruv Nigam

Main category: eess.AS

TL;DR: The paper explores dereverberation techniques using NMFD and its variants, proposing a novel approach that applies NMFD to activation matrices, with mixed quantitative results.

Details

Motivation: Dereverberation is a critical problem in speech processing, aiming to remove reverberant effects from recorded speech signals to improve speech quality and intelligibility.

Method: The paper implements and extends dereverberation techniques using Non-negative Matrix Factor Deconvolution (NMFD). It incorporates convolutive NMF-based representations and frame-stacked models to exploit temporal dependencies. A novel approach applies NMFD to the activation matrix of reverberated magnitude spectrograms.

Result: While qualitative verification of literature claims was achieved, exact results could not be matched. The novel approach showed improvement in quantitative metrics (PESQ and Cepstral Distortion) but was inconsistent.

Conclusion: The study demonstrates the potential of NMFD-based approaches for speech dereverberation, with the novel activation matrix method showing promise despite inconsistency in quantitative improvements.

Abstract: Dereverberation of recorded speech signals is one of the most pertinent problems in speech processing. In the present work, the objective is to understand and implement dereverberation techniques that aim at enhancing the magnitude spectrogram of reverberant speech signals to remove the reverberant effects introduced. An approach to estimate a clean speech spectrogram from the reverberant speech spectrogram is proposed. This is achieved through non-negative matrix factor deconvolution(NMFD). Further, this approach is extended using the NMF representation for speech magnitude spectrograms. To exploit temporal dependencies, a convolutive NMF-based representation and a frame-stacked model are incorporated into the NMFD framework for speech. A novel approach for dereverberation by applying NMFD to the activation matrix of the reverberated magnitude spectrogram is also proposed. Finally, a comparative analysis of the performance of the listed techniques, using sentence recordings from the TIMIT database and recorded room impulse responses from the Reverb 2014 challenge, is presented based on two key objective measures - PESQ and Cepstral Distortion.\ Although we were qualitatively able to verify the claims made in literature regarding these techniques, exact results could not be matched. The novel approach, as it is suggested, provides improvement in quantitative metrics, but is not consistent

[693] Decoding EEG Speech Perception with Transformers and VAE-based Data Augmentation

Terrance Yu-Hao Chen, Yulin Chen, Pontus Soederhaell, Sadrishya Agrawal, Kateryna Shapovalenko

Main category: eess.AS

TL;DR: VAE-based EEG data augmentation and SOTA sequence-to-sequence models improve EEG-based speech decoding, showing promise for sentence generation over classification tasks.

Details

Motivation: EEG-based speech decoding faces challenges like noisy data, limited datasets, and poor performance on complex tasks like speech perception, which hinders BCI applications for silent communication and assistive technologies.

Method: Used variational autoencoders (VAEs) for EEG data augmentation to improve data quality, and applied a state-of-the-art sequence-to-sequence deep learning architecture (originally successful in EMG tasks) to EEG-based speech decoding, adapting it for word classification tasks using the Brennan dataset.

Result: VAEs show potential for reconstructing artificial EEG data for augmentation. The sequence-to-sequence model achieves more promising performance in generating sentences compared to the classification model, though both tasks remain challenging.

Conclusion: The findings lay groundwork for future EEG speech perception decoding research, with possible extensions to speech production tasks like silent or imagined speech, suggesting sequence-to-sequence approaches are promising for complex EEG-based speech decoding tasks.

Abstract: Decoding speech from non-invasive brain signals, such as electroencephalography (EEG), has the potential to advance brain-computer interfaces (BCIs), with applications in silent communication and assistive technologies for individuals with speech impairments. However, EEG-based speech decoding faces major challenges, such as noisy data, limited datasets, and poor performance on complex tasks like speech perception. This study attempts to address these challenges by employing variational autoencoders (VAEs) for EEG data augmentation to improve data quality and applying a state-of-the-art (SOTA) sequence-to-sequence deep learning architecture, originally successful in electromyography (EMG) tasks, to EEG-based speech decoding. Additionally, we adapt this architecture for word classification tasks. Using the Brennan dataset, which contains EEG recordings of subjects listening to narrated speech, we preprocess the data and evaluate both classification and sequence-to-sequence models for EEG-to-words/sentences tasks. Our experiments show that VAEs have the potential to reconstruct artificial EEG data for augmentation. Meanwhile, our sequence-to-sequence model achieves more promising performance in generating sentences compared to our classification model, though both remain challenging tasks. These findings lay the groundwork for future research on EEG speech perception decoding, with possible extensions to speech production tasks such as silent or imagined speech.

[694] Distinctive Feature Codec: An Adaptive Efficient Speech Representation for Depression Detection

Xiangyu Zhang, Fuming Fang, Peng Gao, Bin Qin, Beena Ahmed, Julien Epps

Main category: eess.AS

TL;DR: DFC is an adaptive speech codec that preserves temporal dynamics by dynamically segmenting audio at acoustic transitions, enabling better depression detection compared to fixed-rate tokenization methods.

Details

Motivation: Current LLM-based speech processing uses fixed-rate tokenization that destroys temporal dynamics, which are crucial biomarkers for clinical applications like depression detection.

Method: DFC learns to dynamically segment speech at perceptually significant acoustic transitions, generating variable-length tokens. Uses Group-wise Scalar Quantization (GSQ) for stable quantization of variable-length segments.

Result: First integration of traditional distinctive features into modern deep learning codec for temporally sensitive tasks. Preserves vital timing information that fixed-rate approaches destroy.

Conclusion: DFC offers a promising alternative to conventional frame-based processing and advances interpretable representation learning in speech depression detection frameworks.

Abstract: Large Language Models (LLMs) have demonstrated remarkable success across diverse fields, establishing a powerful paradigm for complex information processing. This has inspired the integration of speech into LLM frameworks, often by tokenizing continuous audio via neural speech codecs, enabling powerful speech language models. However, this dominant tokenization strategy relies on uniform frame-based processing at fixed time intervals. This fixed-rate approach, while effective for linguistic content, destroys the temporal dynamics. These dynamics are not noise but are established as primary biomarkers in clinical applications such as depression detection. To address this gap, we introduce the Distinctive Feature Codec (DFC), an adaptive framework engineered to preserve this vital timing information. Drawing from linguistic theory, DFC abandons fixed-interval processing and instead learns to dynamically segment the signal at perceptually significant acoustic transitions. This generates variable-length tokens that efficiently encode the temporal structure. As a key contribution, this work is the first to integrate traditional distinctive features into a modern deep learning codec for a temporally sensitive task such as depression detection. We also introduce the Group-wise Scalar Quantization (GSQ) approach to stably quantize these variable-length segments. Our distinctive feature-based approach offers a promising alternative to conventional frame-based processing and advances interpretable representation learning in the modern deep learning speech depression detection framework.

eess.IV

[695] Field strength-dependent performance variability in deep learning-based analysis of magnetic resonance imaging

Muhammad Ibtsaam Qadir, Duane Schonlau, Ulrike Dydak, Fiona R. Kolbinger

Main category: eess.IV

TL;DR: MRI scanner field strength (1.5T vs 3.0T) significantly affects deep learning segmentation performance, especially for soft tissues like breast tumors and pancreas, but less for bony structures like cervical spine.

Details

Motivation: To quantitatively evaluate how MRI scanner magnetic field strength impacts the performance and generalizability of deep learning segmentation algorithms, as field strength variations could be a confounding factor in AI studies.

Method: Used three public MRI datasets (breast tumor, pancreas, cervical spine) stratified by field strength. Developed three nnU-Net models per task: 1.5T-only, 3.0T-only, and combined training. Evaluated cross-field performance and analyzed field-strength effects via UMAP clustering and radiomic feature analysis.

Result: For breast tumor and pancreas segmentation, 3.0T-trained models significantly outperformed 1.5T-trained and combined models on both validation sets. Cervical spine segmentation showed minimal cross-field degradation (DSC>0.92). Radiomic analysis revealed moderate field-strength clustering in soft tissues but minimal separation in osseous structures.

Conclusion: Magnetic field strength substantially influences deep learning segmentation performance, particularly for soft-tissue structures, warranting consideration as a confounding factor in AI studies on MRI data.

Abstract: This study quantitatively evaluates the impact of MRI scanner magnetic field strength on the performance and generalizability of deep learning-based segmentation algorithms. Three publicly available MRI datasets (breast tumor, pancreas, and cervical spine) were stratified by scanner field strength (1.5T vs. 3.0T). For each segmentation task, three nnU-Net-based models were developed: A model trained on 1.5T data only (m-1.5T), a model trained on 3.0T data only (m-3.0T), and a model trained on pooled 1.5T and 3.0T data (m-combined). Each model was evaluated on both 1.5T and 3.0T validation sets. Field-strength-dependent performance differences were investigated via Uniform Manifold Approximation and Projection (UMAP)-based clustering and radiomic analysis, including 23 first-order and texture features. For breast tumor segmentation, m-3.0T (DSC: 0.494 [1.5T] and 0.433 [3.0T]) significantly outperformed m-1.5T (DSC: 0.411 [1.5T] and 0.289 [3.0T]) and m-combined (DSC: 0.373 [1.5T] and 0.268[3.0T]) on both validation sets (p<0.0001). Pancreas segmentation showed similar trends: m-3.0T achieved the highest DSC (0.774 [1.5T], 0.840 [3.0T]), while m-1.5T underperformed significantly (p<0.0001). For cervical spine, models performed optimally on same-field validation sets with minimal cross-field performance degradation (DSC>0.92 for all comparisons). Radiomic analysis revealed moderate field-strength-dependent clustering in soft tissues (silhouette scores 0.23-0.29) but minimal separation in osseous structures (0.12). These results indicate that magnetic field strength in the training data substantially influences the performance of deep learning-based segmentation models, particularly for soft-tissue structures (e.g., small lesions). This warrants consideration of magnetic field strength as a confounding factor in studies evaluating AI performance on MRI.

[696] AI-Enhanced Virtual Biopsies for Brain Tumor Diagnosis in Low Resource Settings

Areeb Ehsan

Main category: eess.IV

TL;DR: Lightweight CNN + radiomics fusion for brain tumor classification in low-resource settings, with explainability features.

Details

Motivation: Address challenges in brain tumor diagnosis in low-resource environments where expert neuroradiology, high-end MRI, and invasive biopsies are limited. Overcome constraints of existing deep learning approaches including computational demands, dataset shift, and limited interpretability.

Method: Two-branch approach: 1) MobileNetV2-based CNN for classification, 2) radiomics branch extracting 8 handcrafted features (shape, intensity statistics, GLCM texture). Late fusion concatenates CNN embeddings with radiomics features, followed by RandomForest classifier. Explainability via Grad-CAM visualizations and radiomics feature importance analysis.

Result: Improved validation performance for fusion approach compared to single-branch baselines. Robustness tests under reduced resolution and additive noise reveal sensitivity relevant to low-resource imaging conditions.

Conclusion: Presents a virtual biopsy pipeline for brain tumor classification that balances performance, computational efficiency, and interpretability for low-resource settings. Framed as decision support tool, not replacement for clinical diagnosis or histopathology.

Abstract: Timely brain tumor diagnosis remains challenging in low-resource clinical environments where expert neuroradiology interpretation, high-end MRI hardware, and invasive biopsy procedures may be limited. Although deep learning has achieved strong performance in brain tumor analysis, real-world adoption is constrained by computational demands, dataset shift across scanners, and limited interpretability. This paper presents a prototype virtual biopsy pipeline for four-class classification of 2D brain MRI images using a lightweight convolutional neural network (CNN) and complementary radiomics-style handcrafted features. A MobileNetV2-based CNN is trained for classification, while an interpretable radiomics branch extracts eight features capturing lesion shape, intensity statistics, and gray-level co-occurrence matrix (GLCM) texture descriptors. A late fusion strategy concatenates CNN embeddings with radiomics features and trains a RandomForest classifier on the fused representation. Explainability is provided via Grad-CAM visualizations and radiomics feature importance analysis. Experiments on a public Kaggle brain tumor MRI dataset show improved validation performance for fusion relative to single-branch baselines, while robustness tests under reduced resolution and additive noise highlight sensitivity relevant to low-resource imaging conditions. The system is framed as decision support and not a substitute for clinical diagnosis or histopathology.

[697] Complex Swin Transformer for Accelerating Enhanced SMWI Reconstruction

Muhammad Usman, Sung-Min Gho

Main category: eess.IV

TL;DR: Proposed complex-valued Swin Transformer network for super-resolution reconstruction of SMWI from reduced k-space data, achieving high-quality images with shorter scan times while preserving diagnostic features for Parkinson’s disease.

Details

Motivation: Full-resolution SMWI acquisition for Parkinson's disease detection requires long scan times, creating a need for efficient reconstruction methods that can generate high-quality SMWI from reduced k-space data while preserving diagnostic relevance.

Method: Developed a complex-valued Swin Transformer based network for super-resolution reconstruction of multi-echo MRI data, reconstructing high-quality SMWI images from low-resolution k-space inputs.

Result: Achieved structural similarity index of 0.9116 and mean squared error of 0.076 when reconstructing SMWI from 256 by 256 k-space data, while maintaining critical diagnostic features for Parkinson’s disease assessment.

Conclusion: The method enables high-quality SMWI reconstruction from reduced k-space sampling, leading to shorter scan times without compromising diagnostic detail, potentially improving clinical applicability of SMWI for Parkinson’s disease and supporting faster neuroimaging workflows.

Abstract: Susceptibility Map Weighted Imaging (SMWI) is an advanced magnetic resonance imaging technique used to detect nigral hyperintensity in Parkinsons disease. However, full resolution SMWI acquisition is limited by long scan times. Efficient reconstruction methods are therefore required to generate high quality SMWI from reduced k space data while preserving diagnostic relevance. In this work, we propose a complex valued Swin Transformer based network for super resolution reconstruction of multi echo MRI data. The proposed method reconstructs high quality SMWI images from low resolution k space inputs. Experimental results demonstrate that the method achieves a structural similarity index of 0.9116 and a mean squared error of 0.076 when reconstructing SMWI from 256 by 256 k space data, while maintaining critical diagnostic features. This approach enables high quality SMWI reconstruction from reduced k space sampling, leading to shorter scan times without compromising diagnostic detail. The proposed method has the potential to improve the clinical applicability of SMWI for Parkinsons disease and support faster and more efficient neuroimaging workflows.

[698] SemCovert: Secure and Covert Video Transmission via Deep Semantic-Level Hiding

Zhihan Cao, Xiao Yang, Gaolei Li, Jun Wu, Jianhua Li, Yuchen Liu

Main category: eess.IV

TL;DR: SemCovert: A deep semantic-level hiding framework for secure and covert video transmission that protects privacy while maintaining transmission efficiency in video semantic communication.

Details

Motivation: Video semantic communication faces privacy leakage challenges; traditional security techniques like steganography and encryption are not robust against semantic-level transformations. Temporal continuity in videos enables statistical modeling that increases risk of exposing hidden content and distributional anomalies.

Method: Proposes SemCovert with co-designed semantic hiding model and secret semantic extractor integrated into semantic communication pipeline. Introduces randomized semantic hiding strategy to break determinism of embedding and create unpredictable distribution patterns.

Result: Effectively mitigates eavesdropping and detection risks while reliably concealing secret videos during transmission. Video quality suffers only minor degradation, preserving transmission fidelity.

Conclusion: SemCovert enables secure and covert transmission without compromising semantic communication performance, addressing privacy challenges in video semantic communication systems.

Abstract: Video semantic communication, praised for its transmission efficiency, still faces critical challenges related to privacy leakage. Traditional security techniques like steganography and encryption are challenging to apply since they are not inherently robust against semantic-level transformations and abstractions. Moreover, the temporal continuity of video enables framewise statistical modeling over extended periods, which increases the risk of exposing distributional anomalies and reconstructing hidden content. To address these challenges, we propose SemCovert, a deep semantic-level hiding framework for secure and covert video transmission. SemCovert introduces a pair of co-designed models, namely the semantic hiding model and the secret semantic extractor, which are seamlessly integrated into the semantic communication pipeline. This design enables authorized receivers to reliably recover hidden information, while keeping it imperceptible to regular users. To further improve resistance to analysis, we introduce a randomized semantic hiding strategy, which breaks the determinism of embedding and introduces unpredictable distribution patterns. The experimental results demonstrate that SemCovert effectively mitigates potential eavesdropping and detection risks while reliably concealing secret videos during transmission. Meanwhile, video quality suffers only minor degradation, preserving transmission fidelity. These results confirm SemCovert’s effectiveness in enabling secure and covert transmission without compromising semantic communication performance.

[699] Super-Resolution Enhancement of Medical Images Based on Diffusion Model: An Optimization Scheme for Low-Resolution Gastric Images

Haozhe Jia

Main category: eess.IV

TL;DR: Diffusion-based super-resolution framework (SR3) enhances low-resolution capsule endoscopy images, outperforming traditional methods with better anatomical fidelity.

Details

Motivation: Capsule endoscopy has low-resolution images due to hardware/power constraints, limiting identification of fine mucosal textures and subtle pathologies needed for early diagnosis.

Method: Adopts SR3 framework based on Denoising Diffusion Probabilistic Models (DDPMs) to learn probabilistic mapping from low to high-resolution images. Uses HyperKvasir dataset for training/evaluation, with architectural enhancements including attention mechanisms.

Result: Significantly outperforms bicubic interpolation and GAN-based methods (ESRGAN). Achieves PSNR 27.5 dB and SSIM 0.65 (baseline), improving to 29.3 dB and 0.71 with enhancements. Better preservation of anatomical boundaries, vascular patterns, and lesion structures.

Conclusion: Diffusion-based super-resolution is promising for enhancing capsule endoscopy imaging, addressing fundamental resolution constraints with improved structural fidelity over GAN approaches.

Abstract: Capsule endoscopy has enabled minimally invasive gastrointestinal imaging, but its clinical utility is limited by the inherently low resolution of captured images due to hardware, power, and transmission constraints. This limitation hampers the identification of fine-grained mucosal textures and subtle pathological features essential for early diagnosis. This work investigates a diffusion-based super-resolution framework to enhance capsule endoscopy images in a data-driven and anatomically consistent manner. We adopt the SR3 (Super-Resolution via Repeated Refinement) framework built upon Denoising Diffusion Probabilistic Models (DDPMs) to learn a probabilistic mapping from low-resolution to high-resolution images. Unlike GAN-based approaches that often suffer from training instability and hallucination artifacts, diffusion models provide stable likelihood-based training and improved structural fidelity. The HyperKvasir dataset, a large-scale publicly available gastrointestinal endoscopy dataset, is used for training and evaluation. Quantitative results demonstrate that the proposed method significantly outperforms bicubic interpolation and GAN-based super-resolution methods such as ESRGAN, achieving PSNR of 27.5 dB and SSIM of 0.65 for a baseline model, and improving to 29.3 dB and 0.71 with architectural enhancements including attention mechanisms. Qualitative results show improved preservation of anatomical boundaries, vascular patterns, and lesion structures. These findings indicate that diffusion-based super-resolution is a promising approach for enhancing non-invasive medical imaging, particularly in capsule endoscopy where image resolution is fundamentally constrained.

[700] MEGA-PCC: A Mamba-based Efficient Approach for Joint Geometry and Attribute Point Cloud Compression

Kai-Hsiang Hsieh, Monyneath Yim, Wen-Hsiao Peng, Jui-Chiu Chiang

Main category: eess.IV

TL;DR: MEGA-PCC is an end-to-end learning-based framework for joint compression of point cloud geometry and attributes using Mamba architecture, eliminating manual bitrate tuning and recoloring procedures.

Details

Motivation: Existing point cloud compression methods rely on post-hoc recoloring and manually tuned bitrate allocation between geometry and attributes, which hinders end-to-end optimization and increases system complexity.

Method: Two specialized models: 1) Main compression model with shared encoder for unified latent representation and dual decoders for sequential geometry/attribute reconstruction, 2) Mamba-based Entropy Model (MEM) for enhanced entropy coding by capturing spatial and channel-wise correlations. Both use Mamba architecture for long-range dependency modeling.

Result: MEGA-PCC achieves superior rate-distortion performance and runtime efficiency compared to both traditional and learning-based baselines.

Conclusion: The framework offers a powerful AI-driven solution for point cloud compression by enabling data-driven bitrate allocation during training and simplifying the overall pipeline.

Abstract: Joint compression of point cloud geometry and attributes is essential for efficient 3D data representation. Existing methods often rely on post-hoc recoloring procedures and manually tuned bitrate allocation between geometry and attribute bitstreams in inference, which hinders end-to-end optimization and increases system complexity. To overcome these limitations, we propose MEGA-PCC, a fully end-to-end, learning-based framework featuring two specialized models for joint compression. The main compression model employs a shared encoder that encodes both geometry and attribute information into a unified latent representation, followed by dual decoders that sequentially reconstruct geometry and then attributes. Complementing this, the Mamba-based Entropy Model (MEM) enhances entropy coding by capturing spatial and channel-wise correlations to improve probability estimation. Both models are built on the Mamba architecture to effectively model long-range dependencies and rich contextual features. By eliminating the need for recoloring and heuristic bitrate tuning, MEGA-PCC enables data-driven bitrate allocation during training and simplifies the overall pipeline. Extensive experiments demonstrate that MEGA-PCC achieves superior rate-distortion performance and runtime efficiency compared to both traditional and learning-based baselines, offering a powerful solution for AI-driven point cloud compression.

[701] Semantic contrastive learning for orthogonal X-ray computed tomography reconstruction

Jiashu Dong, Jiabing Xiang, Lisheng Geng, Suqing Tian, Wei Zhao

Main category: eess.IV

TL;DR: Proposed semantic feature contrastive learning loss for sparse-view CT reconstruction, using three-stage U-Net architecture to reduce artifacts and improve image quality.

Details

Motivation: Sparse-view CT reconstruction reduces radiation dose but suffers from severe streak artifacts due to ill-posed conditions. Existing deep learning methods still face challenges in reconstruction quality.

Method: Novel semantic feature contrastive learning loss that evaluates semantic similarity in high-level latent spaces and anatomical similarity in shallow latent spaces. Uses three-stage U-Net architecture: coarse reconstruction, detail refinement, and semantic similarity measurement.

Result: Superior reconstruction quality and faster processing on chest dataset with orthogonal projections compared to other algorithms. Significant improvements in image quality while maintaining low computational complexity.

Conclusion: The proposed method provides a practical solution for orthogonal CT reconstruction, effectively addressing artifact reduction while maintaining computational efficiency.

Abstract: X-ray computed tomography (CT) is widely used in medical imaging, with sparse-view reconstruction offering an effective way to reduce radiation dose. However, ill-posed conditions often result in severe streak artifacts. Recent advances in deep learning-based methods have improved reconstruction quality, but challenges still remain. To address these challenges, we propose a novel semantic feature contrastive learning loss function that evaluates semantic similarity in high-level latent spaces and anatomical similarity in shallow latent spaces. Our approach utilizes a three-stage U-Net-based architecture: one for coarse reconstruction, one for detail refinement, and one for semantic similarity measurement. Tests on a chest dataset with orthogonal projections demonstrate that our method achieves superior reconstruction quality and faster processing compared to other algorithms. The results show significant improvements in image quality while maintaining low computational complexity, making it a practical solution for orthogonal CT reconstruction.

[702] SwinCCIR: An end-to-end deep network for Compton camera imaging reconstruction

Minghao Dong, Xinyang Luo, Xujian Ouyang, Yongshun Xiao

Main category: eess.IV

TL;DR: SwinCCIR is an end-to-end deep learning framework using swin-transformer blocks and transposed convolution for Compton camera imaging, overcoming artifacts and systematic errors from traditional back-projection methods.

Details

Motivation: Compton cameras suffer from severe artifacts and deformation due to back-projection reconstruction, and systematic errors from device performance that are hard to remove through calibration, leading to poor imaging quality.

Method: Proposed SwinCCIR, an end-to-end deep learning framework using swin-transformer blocks and transposed convolution-based image generation module to establish direct relationship between list-mode events and radioactive source distribution.

Result: SwinCCIR was trained and validated on both simulated and practical datasets, effectively overcoming problems of conventional Compton camera imaging and showing promise for practical applications.

Conclusion: The proposed SwinCCIR framework successfully addresses fundamental limitations of Compton camera reconstruction and is expected to be implemented in practical applications.

Abstract: Compton cameras (CCs) are a kind of gamma cameras which are designed to determine the directions of incident gammas based on the Compton scatter. However, the reconstruction of CCs face problems of severe artifacts and deformation due to the fundamental reconstruction principle of back-projection of Compton cones. Besides, a part of systematic errors originated from the performance of devices are hard to remove through calibration, leading to deterioration of imaging quality. Iterative algorithms and deep-learning based methods have been widely used to improve reconstruction. But most of them are optimization based on the results of back-projection. Therefore, we proposed an end-to-end deep learning framework, SwinCCIR, for CC imaging. Through adopting swin-transformer blocks and a transposed convolution-based image generation module, we established the relationship between the list-mode events and the radioactive source distribution. SwinCCIR was trained and validated on both simulated and practical dataset. The experimental results indicate that SwinCCIR effectively overcomes problems of conventional CC imaging, which are expected to be implemented in practical applications.

[703] EIR: Enhanced Image Representations for Medical Report Generation

Qiang Sun, Zongcheng Ji, Yinlong Xiao, Peng Chang, Jun Yu

Main category: eess.IV

TL;DR: EIR method improves chest X-ray report generation by addressing information asymmetry between visual and metadata representations and bridging domain gap through medical domain pre-training.

Details

Motivation: Medical report generation from chest X-rays is critical but time-consuming, especially in emergencies. Existing methods suffer from information asymmetry when integrating visual and metadata representations, and use natural domain pre-trained models that create domain gaps with medical images.

Method: Proposes Enhanced Image Representations (EIR) with two key innovations: 1) Cross-modal transformers to fuse metadata representations with image representations, addressing information asymmetry; 2) Medical domain pre-trained models to encode medical images, bridging the domain gap between general and medical images.

Result: Experimental results on widely used MIMIC and Open-I datasets demonstrate the effectiveness of the proposed method in generating accurate chest X-ray reports.

Conclusion: The EIR approach successfully addresses key limitations in automatic medical report generation by solving information asymmetry through cross-modal transformers and bridging domain gaps with medical domain pre-training, leading to improved report accuracy.

Abstract: Generating medical reports from chest X-ray images is a critical and time-consuming task for radiologists, especially in emergencies. To alleviate the stress on radiologists and reduce the risk of misdiagnosis, numerous research efforts have been dedicated to automatic medical report generation in recent years. Most recent studies have developed methods that represent images by utilizing various medical metadata, such as the clinical document history of the current patient and the medical graphs constructed from retrieved reports of other similar patients. However, all existing methods integrate additional metadata representations with visual representations through a simple “Add and LayerNorm” operation, which suffers from the information asymmetry problem due to the distinct distributions between them. In addition, chest X-ray images are usually represented using pre-trained models based on natural domain images, which exhibit an obvious domain gap between general and medical domain images. To this end, we propose a novel approach called Enhanced Image Representations (EIR) for generating accurate chest X-ray reports. We utilize cross-modal transformers to fuse metadata representations with image representations, thereby effectively addressing the information asymmetry problem between them, and we leverage medical domain pre-trained models to encode medical images, effectively bridging the domain gap for image representation. Experimental results on the widely used MIMIC and Open-I datasets demonstrate the effectiveness of our proposed method.

[704] NLCG-Net: A Model-Based Zero-Shot Learning Framework for Undersampled Quantitative MRI Reconstruction

Xinrui Jiang, Yohan Jun, Jaejin Cho, Mengze Gao, Xingwang Yong, Berkin Bilgic

Main category: eess.IV

TL;DR: NLCG-Net is a model-based deep learning framework for joint T2/T1 estimation that directly reconstructs qMRI maps from undersampled k-space using scan-specific neural network regularization, outperforming traditional two-step methods at high acceleration factors.

Details

Motivation: Traditional qMRI methods use a two-step pipeline (image reconstruction then model fitting) that suffers from biases and error propagation, especially with undersampled k-space data. There's a need for more robust joint estimation methods that can handle high acceleration factors.

Method: NLCG-Net combines model-based nonlinear conjugate gradient optimization with a U-Net regularizer trained in a scan-specific, zero-shot fashion. It directly estimates T2/T1 maps from undersampled k-space using mono-exponential signal modeling with neural network regularization.

Result: Experimental results show NLCG-Net improves estimation quality over subspace reconstruction methods, particularly at high acceleration factors, enabling high-fidelity T1 and T2 mapping from undersampled data.

Conclusion: The proposed NLCG-Net framework successfully addresses limitations of traditional two-step qMRI pipelines by enabling direct, high-quality T2/T1 estimation from undersampled k-space data through scan-specific neural network regularization.

Abstract: Typical quantitative MRI (qMRI) methods estimate parameter maps in a two-step pipeline that first reconstructs images from undersampled k-space data and then performs model fitting, which is prone to biases and error propagation. We propose NLCG-Net, a model-based nonlinear conjugate gradient (NLCG) framework for joint T2/T1 estimation that incorporates a U-Net regularizer trained in a scan-specific, zero-shot fashion. The method directly estimates qMRI maps from undersampled k-space using mono-exponential signal modeling with scan-specific neural network regularization, enabling high-fidelity T1 and T2 mapping. Experimental results on T2 and T1 mapping demonstrate that NLCG-Net improves estimation quality over subspace reconstruction at high acceleration factors.

[705] Image and Video Quality Assessment using Prompt-Guided Latent Diffusion Models for Cross-Dataset Generalization

Shankhanil Mitra, Diptanu De, Shika Rao, Rajiv Soundararajan

Main category: eess.IV

TL;DR: The paper proposes a novel quality assessment method using diffusion models with learnable quality-aware text prompts and cross-attention maps, achieving superior generalization across diverse image and video datasets.

Details

Motivation: Current quality assessment methods have limited generalization across diverse image/video datasets with distribution shifts. The authors aim to develop a more generalized approach that works well across different types of visual content.

Method: 1) Leverage diffusion model denoising process for quality assessment by aligning learnable quality-aware text prompts with images/video frames. 2) Extract cross-attention maps from intermediate layers of latent diffusion models to capture quality-aware representations. 3) For videos, use frame-rate sub-sampling to reduce computation, and introduce a temporal quality modulator to compensate for lost motion information.

Result: Extensive cross-database experiments across user-generated, synthetic, low-light, frame-rate variation, ultra high definition, and streaming content databases show superior generalization performance in both image and video quality assessment.

Conclusion: The proposed diffusion-based approach with quality-aware text prompts and temporal modulation effectively addresses generalization challenges in quality assessment, outperforming state-of-the-art methods across diverse datasets.

Abstract: The design of image and video quality assessment (QA) algorithms is extremely important to benchmark and calibrate user experience in modern visual systems. A major drawback of the state-of-the-art QA methods is their limited ability to generalize across diverse image and video datasets with reasonable distribution shifts. In this work, we leverage the denoising process of diffusion models for generalized image QA (IQA) and video QA (VQA) by understanding the degree of alignment between learnable quality-aware text prompts and images or video frames. In particular, we learn cross-attention maps from intermediate layers of the denoiser of latent diffusion models (LDMs) to capture quality-aware representations of images or video frames. Since applying text-to-image LDMs for every video frame is computationally expensive for videos, we only estimate the quality of a frame-rate sub-sampled version of the original video. To compensate for the loss in motion information due to frame-rate sub-sampling, we propose a novel temporal quality modulator. Our extensive cross-database experiments across various user-generated, synthetic, low-light, frame-rate variation, ultra high definition, and streaming content-based databases show that our model can achieve superior generalization in both IQA and VQA.

[706] Re-Visible Dual-Domain Self-Supervised Deep Unfolding Network for MRI Reconstruction

Hao Zhang, Qi Wang, Jian Sun, Zhijie Wen, Jun Shi, Shihui Ying

Main category: eess.IV

TL;DR: Proposes a self-supervised deep unfolding network for accelerated MRI reconstruction using only under-sampled data, avoiding the need for fully-sampled training datasets.

Details

Motivation: MRI acquisition is slow, and supervised deep learning methods require expensive fully-sampled datasets. Self-supervised methods exist but lose information by partitioning under-sampled data and don't fully incorporate image priors.

Method: Re-visible dual-domain self-supervised deep unfolding network with re-visible dual-domain loss to use all under-sampled k-space data. Uses DUN-CP-PPA based on Chambolle and Pock Proximal Point Algorithm with Spatial-Frequency Feature Extraction blocks to capture global/local features.

Result: Significantly outperforms state-of-the-art approaches on fastMRI and IXI datasets in terms of reconstruction performance.

Conclusion: The proposed method effectively addresses limitations of existing self-supervised MRI reconstruction by utilizing all under-sampled data and incorporating comprehensive image priors through a deep unfolding architecture.

Abstract: Magnetic Resonance Imaging (MRI) is widely used in clinical practice, but suffered from prolonged acquisition time. Although deep learning methods have been proposed to accelerate acquisition and demonstrate promising performance, they rely on high-quality fully-sampled datasets for training in a supervised manner. However, such datasets are time-consuming and expensive-to-collect, which constrains their broader applications. On the other hand, self-supervised methods offer an alternative by enabling learning from under-sampled data alone, but most existing methods rely on further partitioned under-sampled k-space data as model’s input for training, resulting in a loss of valuable information. Additionally, their models have not fully incorporated image priors, leading to degraded reconstruction performance. In this paper, we propose a novel re-visible dual-domain self-supervised deep unfolding network to address these issues when only under-sampled datasets are available. Specifically, by incorporating re-visible dual-domain loss, all under-sampled k-space data are utilized during training to mitigate information loss caused by further partitioning. This design enables the model to implicitly adapt to all under-sampled k-space data as input. Additionally, we design a deep unfolding network based on Chambolle and Pock Proximal Point Algorithm (DUN-CP-PPA) to achieve end-to-end reconstruction, incorporating imaging physics and image priors to guide the reconstruction process. By employing a Spatial-Frequency Feature Extraction (SFFE) block to capture global and local feature representation, we enhance the model’s efficiency to learn comprehensive image priors. Experiments conducted on the fastMRI and IXI datasets demonstrate that our method significantly outperforms state-of-the-art approaches in terms of reconstruction performance.

[707] UltraBoneUDF: Self-supervised Bone Surface Reconstruction from Ultrasound Based on Neural Unsigned Distance Functions

Luohong Wu, Matthias Seibold, Nicola A. Cavalcanti, Giuseppe Loggia, Lisa Reissner, Bastian Sigrist, Jonas Hein, Lilian Calvet, Arnd Viehöfer, Philipp Fürnstahl

Main category: eess.IV

TL;DR: UltraBoneUDF: A self-supervised framework for reconstructing open bone surfaces from 3D ultrasound data using unsigned distance functions and local tangent plane optimization.

Details

Motivation: Ultrasound offers radiation-free, cost-effective bone imaging for orthopedic surgery but captures only partial bone surfaces with variability. Existing methods struggle with this challenging data, creating reconstruction errors and artifacts.

Method: Self-supervised framework that learns unsigned distance functions (UDFs) from 3D ultrasound data with a novel loss function based on local tangent plane optimization to improve surface reconstruction quality.

Result: Achieves comparable or lower bi-directional Chamfer distance across three datasets: 1.60mm on UltraBones100k (25.5% improvement), 0.21mm on OpenBoneCT, and 0.18mm on ClosedBoneCT, with fewer parameters than competing methods.

Conclusion: UltraBoneUDF effectively addresses the challenge of reconstructing open bone surfaces from real-world ultrasound data, outperforming state-of-the-art methods and demonstrating practical value for computer-assisted orthopedic surgery.

Abstract: Bone surface reconstruction is an essential component of computer-assisted orthopedic surgery(CAOS), forming the foundation for both preoperative planning and intraoperative guidance. Compared to traditional imaging modalities such as computed tomography (CT) and magnetic resonance imaging (MRI),ultrasound, an emerging CAOS technology, provides a radiation-free, cost-effective, and portable alternative. While ultrasound offers new opportunities in CAOS, technical shortcomings continue to hinder its translation into surgery. In particular, due to the inherent limitations of ultrasound imaging, B-mode ultrasound typically captures only partial bone surfaces. The inter- and intra-operator variability in ultrasound scanning further increases the complexity of the data. Existing reconstruction methods struggle with such challenging data, leading to increased reconstruction errors and artifacts, such as holes and inflated structures. Effective techniques for accurately reconstructing open bone surfaces from real-world 3D ultrasound volumes remain lacking. We propose UltraBoneUDF, a self-supervised framework specifically designed for reconstructing open bone surfaces from ultrasound data. It learns unsigned distance functions (UDFs) from 3D ultrasound data. In addition, we present a novel loss function based on local tangent plane optimization that substantially improves surface reconstruction quality. UltraBoneUDF and competing models are benchmarked on three open-source datasets and further evaluated through ablation studies. Qualitative results demonstrate the limitations of the state-of-the-art methods. Quantitatively, UltraBoneUDF achieves comparable or lower bi-directional Chamfer distance across three datasets with fewer parameters: 1.60 mm on the UltraBones100k dataset (~25.5% improvement), 0.21 mm on the OpenBoneCT dataset, and 0.18 mm on the ClosedBoneCT dataset.

[708] Optimization of Fractal Image Compression

Nastaran Pourshab Mohsen Bagheritabar

Main category: eess.IV

TL;DR: The paper proposes optimization techniques for Fractal Image Compression (FIC) using a Box Counting Method to improve compression ratio and reduce computational time.

Details

Motivation: Fractal Image Compression achieves high compression ratios but suffers from computationally expensive compression processes, creating a need for optimization techniques to improve efficiency.

Method: The paper explores a novel Box Counting Method for estimating fractal dimensions, which is simpler to integrate into FIC compared to other algorithms, focusing on optimization techniques to increase compression ratio and reduce computational time.

Result: Implementing these optimization techniques enhances both the compression ratio and the compression time of Fractal Image Compression.

Conclusion: The proposed Box Counting Method provides an effective optimization approach for FIC, improving both compression efficiency and computational performance.

Abstract: Fractal Image Compression (FIC) is a lossy image compression technique that leverages self-similarity within an image to achieve high compression ratios. However, the process of compressing the image is computationally expensive. This paper investigates optimization techniques to improve the efficiency of FIC, focusing on increasing compression ratio and reducing computational time. The paper explores a novel approach named the Box Counting Method for estimating fractal dimensions, which is very simple to integrate into FIC compared to other algorithms. The results show that implementing these optimization techniques enhances both the compression ratio and the compression time.

[709] Resource-efficient medical image classification for edge devices

Mahsa Lavaei, Zahra Abadi, Salar Beigzad, Alireza Maleki

Main category: eess.IV

TL;DR: This paper investigates model quantization techniques for deploying medical image classification models on resource-constrained edge devices, achieving reduced computational overhead and memory requirements while maintaining diagnostic accuracy.

Details

Motivation: Medical image classification is crucial for healthcare diagnosis, but deploying deep learning models on edge devices is challenging due to computational and memory limitations. There's a need for resource-efficient approaches to enable AI-driven diagnostics in remote and resource-limited settings.

Method: The study employs model quantization techniques, specifically focusing on quantization-aware training (QAT) and post-training quantization (PTQ) methods optimized for edge devices. These techniques reduce the precision of model parameters and activations to lower computational overhead.

Result: Quantized models achieve substantial reductions in model size and inference latency, enabling real-time processing on edge hardware while maintaining clinically acceptable diagnostic accuracy across medical imaging datasets.

Conclusion: This work provides a practical pathway for deploying AI-driven medical diagnostics in remote and resource-limited settings, enhancing the accessibility and scalability of healthcare technologies through efficient edge deployment of medical image classification models.

Abstract: Medical image classification is a critical task in healthcare, enabling accurate and timely diagnosis. However, deploying deep learning models on resource-constrained edge devices presents significant challenges due to computational and memory limitations. This research investigates a resource-efficient approach to medical image classification by employing model quantization techniques. Quantization reduces the precision of model parameters and activations, significantly lowering computational overhead and memory requirements without sacrificing classification accuracy. The study focuses on the optimization of quantization-aware training (QAT) and post-training quantization (PTQ) methods tailored for edge devices, analyzing their impact on model performance across medical imaging datasets. Experimental results demonstrate that quantized models achieve substantial reductions in model size and inference latency, enabling real-time processing on edge hardware while maintaining clinically acceptable diagnostic accuracy. This work provides a practical pathway for deploying AI-driven medical diagnostics in remote and resource-limited settings, enhancing the accessibility and scalability of healthcare technologies.

[710] CLIP Based Region-Aware Feature Fusion for Automated BBPS Scoring in Colonoscopy Images

Yujia Fu, Zhiyu Dong, Tianwen Qian, Chenye Zheng, Danian Ji, Linhai Zhuo

Main category: eess.IV

TL;DR: Automated Boston Bowel Preparation Scale scoring using CLIP with adapter-based transfer learning and fecal-feature extraction, outperforming baselines on new dataset and public NERTHU dataset.

Details

Motivation: Manual BBPS scoring suffers from subjectivity and inter-observer variability, requiring automated solutions for consistent bowel cleanliness assessment in colonoscopy.

Method: Proposes a novel framework using CLIP model with adapter-based transfer learning and dedicated fecal-feature extraction branch, fusing global visual features with stool-related textual priors without explicit segmentation.

Result: Extensive experiments on new dataset (2,240 images from 517 subjects) and public NERTHU dataset demonstrate superiority over existing baselines.

Conclusion: The approach shows potential for clinical deployment in computer-aided colonoscopy analysis by providing accurate, automated bowel cleanliness assessment.

Abstract: Accurate assessment of bowel cleanliness is essential for effective colonoscopy procedures. The Boston Bowel Preparation Scale (BBPS) offers a standardized scoring system but suffers from subjectivity and inter-observer variability when performed manually. In this paper, to support robust training and evaluation, we construct a high-quality colonoscopy dataset comprising 2,240 images from 517 subjects, annotated with expert-agreed BBPS scores. We propose a novel automated BBPS scoring framework that leverages the CLIP model with adapter-based transfer learning and a dedicated fecal-feature extraction branch. Our method fuses global visual features with stool-related textual priors to improve the accuracy of bowel cleanliness evaluation without requiring explicit segmentation. Extensive experiments on both our dataset and the public NERTHU dataset demonstrate the superiority of our approach over existing baselines, highlighting its potential for clinical deployment in computer-aided colonoscopy analysis.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Open-Source Multimodal Moxin Models with Moxin-VLM and Moxin-VLA

[2] Hierarchical Geometry of Cognitive States in Transformer Embedding Spaces

[3] SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents

[4] The Syntax of qulk-clauses in Yemeni Ibbi Arabic: A Minimalist Approach

[5] Towards Efficient Post-Training via Fourier-Driven Adapter Architectures

[6] LLM-Guided Exemplar Selection for Few-Shot Wearable-Sensor Human Activity Recognition

[7] Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing

[8] Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models

[9] Hallucination Detection and Evaluation of Large Language Model

[10] PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech

[11] HiFi-RAG: Hierarchical Content Filtering and Two-Pass Generation for Open-Domain RAG

[12] Exploring the Vertical-Domain Reasoning Capabilities of Large Language Models

[13] Fun-Audio-Chat Technical Report

[14] Constituency Structure over Eojeol in Korean Treebanks

[15] ManchuTTS: Towards High-Quality Manchu Speech Synthesis via Flow Matching and Hierarchical Text Representation

[16] Learning When Not to Attend Globally

[17] Structured Prompting and LLM Ensembling for Multimodal Conversational Aspect-based Sentiment Analysis

[18] Chain-of-thought Reviewing and Correction for Time Series Question Answering

[19] M2G-Eval: Enhancing and Evaluating Multi-granularity Multilingual Code Generation

[20] On the Role of Discreteness in Diffusion LLMs

[21] Evaluating GRPO and DPO for Faithful Chain-of-Thought Reasoning in LLMs

[22] Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

[23] Conformal Prediction Sets for Next-Token Prediction in Large Language Models: Balancing Coverage Guarantees with Set Efficiency

[24] GHaLIB: A Multilingual Framework for Hope Speech Detection in Low-Resource Languages

[25] Nested Browser-Use Learning for Agentic Information Seeking

[26] Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages

[27] Mitigating Social Desirability Bias in Random Silicon Sampling

[28] Data Augmentation for Classification of Negative Pregnancy Outcomes in Imbalanced Data

[29] WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference

[30] Harnessing Large Language Models for Biomedical Named Entity Recognition

[31] Text-Routed Sparse Mixture-of-Experts Model with Explanation and Temporal Alignment for Multi-Modal Sentiment Analysis

[32] Fake News Classification in Urdu: A Domain Adaptation Approach for a Low-Resource Language

[33] CNSight: Evaluation of Clinical Note Segmentation Tools

[34] NepEMO: A Multi-Label Emotion and Sentiment Analysis on Nepali Reddit with Linguistic Insights and Temporal Trends

[35] AutoForge: Automated Environment Synthesis for Agentic Reinforcement Learning

[36] Diversity or Precision? A Deep Dive into Next Token Prediction

[37] Prompt engineering does not universally improve Large Language Model performance across clinical decision-making tasks

[38] Improving Generalization in LLM Structured Pruning via Function-Aware Neuron Grouping

[39] LENS: LLM-Enabled Narrative Synthesis for Mental Health by Aligning Multimodal Sensing with Language Models

[40] Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization

[41] Accelerating Language Model Workflows with Prompt Choreography

[42] TabiBERT: A Large-Scale ModernBERT Foundation Model and Unified Benchmarking Framework for Turkish

[43] Reservoir Computing inspired Matrix Multiplication-free Language Model

[44] Not too long do read: Evaluating LLM-generated extreme scientific summaries

[45] Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

[46] Anka: A Domain-Specific Language for Reliable LLM Code Generation

[47] Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation

[48] Chinese Morph Resolution in E-commerce Live Streaming Scenarios

[49] AI4Reading: Chinese Audiobook Interpretation System Based on Multi-Agent Collaboration

[50] AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents

[51] A Stepwise-Enhanced Reasoning Framework for Large Language Models Based on External Subgraph Generation

[52] Entropy-Guided Token Dropout: Training Autoregressive Language Models with Limited Domain Data

[53] The Effect of Gender Diversity on Scientific Team Impact: A Team Roles Perspective

[54] C2PO: Diagnosing and Disentangling Bias Shortcuts in LLMs

[55] ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning

[56] Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

[57] Semantic Tree Inference on Text Corpa using a Nested Density Approach together with Large Language Model Embeddings

[58] Automatic Detection of Complex Quotation Patterns in Aggadic Literature

[59] UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?

[60] Single LLM Debate, MoLaCE: Mixture of Latent Concept Experts Against Confirmation Bias

[61] Lie to Me: Knowledge Graphs for Robust Hallucination Self-Detection in LLMs

[62] Instruction-Following Evaluation of Large Vision-Language Models

[63] Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing

[64] A Dataset and Benchmark for Consumer Healthcare Question Summarization

[65] Less is more: Probabilistic reduction is best explained by small-scale predictability measures

[66] Multilingual Hidden Prompt Injection Attacks on LLM-Based Academic Reviewing

[67] Fine-Tuning LLMs with Fine-Grained Human Feedback on Text Spans

[68] Eliciting Behaviors in Multi-Turn Conversations

[69] Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs

[70] Patience Is The Key to Large Language Model Reasoning

[71] The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models

[72] Topic-FlipRAG: Topic-Orientated Adversarial Opinion Manipulation Attacks to Retrieval-Augmented Generation Models

[73] SelfCheck-Eval: A Multi-Module Framework for Zero-Resource Hallucination Detection in Large Language Models

[74] Atom of Thoughts for Markov LLM Test-Time Scaling

[75] Who Writes What: Unveiling the Impact of Author Roles on AI-generated Text Detection

[76] Forecasting Clinical Risk from Textual Time Series: Structuring Narratives for Temporal AI in Healthcare

[77] Analyzing Cognitive Differences Among Large Language Models through the Lens of Social Worldview