Daily arXiv Papers - 2025-11-20

AI-enhanced summaries of 0 research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] Test-time Scaling of LLMs: A Survey from A Subproblem Structure Perspective

Zhuoyi Yang, Xu Guo, Tong Zhang, Huijuan Xu, Boyang Li

Main category: cs.CL

TL;DR: Survey of test-time scaling methods that improve LLM accuracy by allocating additional compute at inference, focusing on problem decomposition and topological organization.

DetailsMotivation: To improve predictive accuracy of pretrained large language models by strategically allocating additional computational resources during inference time rather than just during training.

Method: Categorizes test-time scaling methods based on how problems are decomposed into subproblems and their topological organization (sequential, parallel, or tree-structured), unifying approaches like Chain-of-Thought, Branch-Solve-Merge, and Tree-of-Thought.

Result: Provides a unified framework for understanding diverse test-time scaling techniques and synthesizes existing analyses of their respective strengths and weaknesses.

Conclusion: Outlines promising directions for future research in test-time scaling methods for improving LLM performance.

Abstract: With this paper, we survey techniques for improving the predictive accuracy of pretrained large language models by allocating additional compute at inference time. In categorizing test-time scaling methods, we place special emphasis on how a problem is decomposed into subproblems and on the topological organization of these subproblems whether sequential, parallel, or tree-structured. This perspective allows us to unify diverse approaches such as Chain-of-Thought, Branch-Solve-Merge, and Tree-of-Thought under a common lens. We further synthesize existing analyses of these techniques, highlighting their respective strengths and weaknesses, and conclude by outlining promising directions for future research

[2] Temporal Predictors of Outcome in Reasoning Language Models

Joey David

Main category: cs.CL

TL;DR: LLMs internally commit to final answers very early in reasoning chains, with correctness predictable after just a few tokens, even for complex problems requiring longer reasoning.

DetailsMotivation: To understand when LLMs internally commit to final outcomes during chain-of-thought reasoning, and investigate the relationship between reasoning length and problem difficulty.

Method: Train linear classifiers on hidden states after first t reasoning tokens to predict eventual correctness, analyzing predictive accuracy across different question difficulties.

Result: Eventual correctness is highly predictable after only a few reasoning tokens. Harder questions show predictive accuracy drops and are overrepresented in long reasoning chains.

Conclusion: LLMs develop internal self-assessment of success very early in reasoning, with implications for model interpretability and real-time inference control.

Abstract: The chain-of-thought (CoT) paradigm uses the elicitation of step-by-step rationales as a proxy for reasoning, gradually refining the model’s latent representation of a solution. However, it remains unclear just how early a Large Language Model (LLM) internally commits to an eventual outcome. We probe this by training linear classifiers on hidden states after the first t reasoning tokens, showing that eventual correctness is highly predictable after only a few tokens, even when longer outputs are needed to reach a definite answer. We show that, for harder questions, a drop in predictive accuracy highlights a selection artifact: hard items are disproportionately represented in long CoTs. Overall, our results imply that for reasoning models, internal self-assessment of success tends to emerge after only a few tokens, with implications for interpretability and for inference-time control.

[3] LiveCLKTBench: Towards Reliable Evaluation of Cross-Lingual Knowledge Transfer in Multilingual LLMs

Pei-Fu Guo, Yun-Da Tsai, Chun-Chia Hsu, Kai-Xin Chen, Ya-An Tsai, Kai-Wei Chang, Nanyun Peng, Mi-Yen Yeh, Shou-De Lin

Main category: cs.CL

TL;DR: LiveCLKTBench is an automated pipeline that isolates cross-lingual knowledge transfer in LLMs using time-sensitive knowledge entities, revealing transfer is influenced by linguistic distance and asymmetric across languages.

DetailsMotivation: To address the challenge of distinguishing genuine cross-lingual knowledge transfer from prior pre-training exposure in large language models.

Method: Automated pipeline that identifies time-sensitive knowledge entities from real-world domains, filters by temporal occurrence, verifies against model knowledge, generates factual questions, and translates them across languages.

Result: Cross-lingual transfer is strongly influenced by linguistic distance, often asymmetric across language directions; larger models improve transfer but gains diminish with scale and vary across domains.

Conclusion: LiveCLKTBench provides reliable benchmark for multilingual transfer research and reveals new insights about transfer patterns influenced by linguistic factors.

Abstract: Evaluating cross-lingual knowledge transfer in large language models is challenging, as correct answers in a target language may arise either from genuine transfer or from prior exposure during pre-training. We present LiveCLKTBench, an automated generation pipeline specifically designed to isolate and measure cross-lingual knowledge transfer. Our pipeline identifies self-contained, time-sensitive knowledge entities from real-world domains, filters them based on temporal occurrence, and verifies them against the model’s knowledge. The documents of these valid entities are then used to generate factual questions, which are translated into multiple languages to evaluate transferability across linguistic boundaries. Using LiveCLKTBench, we evaluate several LLMs across five languages and observe that cross-lingual transfer is strongly influenced by linguistic distance and often asymmetric across language directions. While larger models improve transfer, the gains diminish with scale and vary across domains. These findings provide new insights into multilingual transfer and demonstrate the value of LiveCLKTBench as a reliable benchmark for future research.

[4] COMPASS: Context-Modulated PID Attention Steering System for Hallucination Mitigation

Snigdha Pandya, Rohan Nagale, Kenji Sahay, Anna Lin, Shikhar Shiromani, Kevin Zhu, Dev Sunishchal

Main category: cs.CL

TL;DR: COMPASS is a lightweight control framework that uses a PID controller to dynamically modulate attention heads in LLMs during decoding, reducing factual hallucinations by maintaining better alignment with contextual evidence.

DetailsMotivation: LLMs often generate factually incorrect statements despite having access to relevant evidence, due to improper attention allocation between contextual and parametric knowledge. Understanding and steering this internal behavior is crucial for trustworthy deployment and scientific interpretability.

Method: Introduces COMPASS framework with Context Reliance Score (CRS) as an online probe to quantify context reliance. Uses a PID controller to dynamically modulate attention heads during decoding without retraining or multi-pass decoding.

Result: Across multiple benchmarks (HotpotQA, XSum, HaluEval, RAGTruth), COMPASS consistently reduces contextual hallucination rates by 2.8 to 5.8 percent absolute, while revealing how distinct attention heads contribute to evidence alignment.

Conclusion: Feedback-driven interpretability provides a pathway toward scientific understanding of LLM behavior, demonstrating that lightweight control frameworks can effectively reduce factual inconsistencies in model generation.

Abstract: Large language models (LLMs) often generate fluent but factually incorrect statements despite having access to relevant evidence, a failure mode rooted in how they allocate attention between contextual and parametric knowledge. Understanding and steering this internal behavior is key both for trustworthy deployment and for scientific interpretability of model mechanisms. We introduce COMPASS (Context-Modulated PID Attention Steering System), a lightweight, interpretable control framework that embeds a model-based feedback loop directly within decoding. COMPASS quantifies context reliance via a transparent metric, the Context Reliance Score (CRS), which serves as an online probe of how attention heads ground generation in evidence. Using this interpretable signal, a PID controller dynamically modulates attention heads to maintain factual consistency without retraining or multi-pass decoding. Across benchmarks (HotpotQA, XSum, HaluEval, RAGTruth), COMPASS consistently reduces contextual hallucination rates (2.8 to 5.8 percent absolute) while revealing how distinct attention heads contribute to evidence alignment. These results highlight feedback-driven interpretability as a pathway toward scientific understanding of LLM behavior.

[5] The Impact of Prosodic Segmentation on Speech Synthesis of Spontaneous Speech

Julio Cesar Galdino, Sidney Evaldo Leal, Leticia Gabriella De Souza, Rodrigo de Freitas Lima, Antonio Nelson Fornari Mendes Moreira, Arnaldo Candido Junior, Miguel Oliveira, Edresson Casanova, Sandra M. Aluísio

Main category: cs.CL

TL;DR: Evaluates effects of manual vs automatic prosodic segmentation on Brazilian Portuguese speech synthesis using FastSpeech 2, finding prosodic segmentation improves intelligibility and naturalness.

DetailsMotivation: Spontaneous speech synthesis faces challenges in capturing natural conversational elements like turn-taking and disfluencies, with limited exploration of explicit prosodic segmentation datasets.

Method: Used FastSpeech 2 non-autoregressive model trained with manual and automatic prosodic segmentation annotations on Brazilian Portuguese data.

Result: Prosodic segmentation produced slightly more intelligible and acoustically natural speech; manual segmentation introduced greater variability for more natural prosody.

Conclusion: Prosodic segmentation improves spontaneous speech synthesis quality, with manual annotations providing better prosodic variability, and all resources are publicly available.

Abstract: Spontaneous speech presents several challenges for speech synthesis, particularly in capturing the natural flow of conversation, including turn-taking, pauses, and disfluencies. Although speech synthesis systems have made significant progress in generating natural and intelligible speech, primarily through architectures that implicitly model prosodic features such as pitch, intensity, and duration, the construction of datasets with explicit prosodic segmentation and their impact on spontaneous speech synthesis remains largely unexplored. This paper evaluates the effects of manual and automatic prosodic segmentation annotations in Brazilian Portuguese on the quality of speech synthesized by a non-autoregressive model, FastSpeech 2. Experimental results show that training with prosodic segmentation produced slightly more intelligible and acoustically natural speech. While automatic segmentation tends to create more regular segments, manual prosodic segmentation introduces greater variability, which contributes to more natural prosody. Analysis of neutral declarative utterances showed that both training approaches reproduced the expected nuclear accent pattern, but the prosodic model aligned more closely with natural pre-nuclear contours. To support reproducibility and future research, all datasets, source codes, and trained models are publicly available under the CC BY-NC-ND 4.0 license.

[6] Human or LLM as Standardized Patients? A Comparative Study for Medical Education

Bingquan Zhang, Xiaoxiao Liu, Yuchi Wang, Lei Zhou, Qianqian Xie, Benyou Wang

Main category: cs.CL

TL;DR: EasyMED is a multi-agent AI framework that simulates Standardized Patients, matching human SP learning outcomes while offering better scalability, cost efficiency, and psychological safety.

DetailsMotivation: Standardized Patients are expensive, inflexible, and difficult to scale, while existing LLM-based simulators show inconsistent behavior and lack rigorous comparison with human SP.

Method: Multi-agent framework with Patient Agent for realistic dialogue, Auxiliary Agent for factual consistency, and Evaluation Agent for actionable feedback. SPBench benchmark with real SP-doctor interactions across 14 specialties and 8 evaluation criteria.

Result: EasyMED matches human SP learning outcomes, produces greater skill gains for lower-baseline students, and offers improved flexibility, psychological safety, and cost efficiency.

Conclusion: EasyMED provides a scalable, cost-effective alternative to human Standardized Patients while maintaining educational effectiveness, particularly benefiting lower-performing students.

Abstract: Standardized Patients (SP) are indispensable for clinical skills training but remain expensive, inflexible, and difficult to scale. Existing large-language-model (LLM)-based SP simulators promise lower cost yet show inconsistent behavior and lack rigorous comparison with human SP. We present EasyMED, a multi-agent framework combining a Patient Agent for realistic dialogue, an Auxiliary Agent for factual consistency, and an Evaluation Agent that delivers actionable feedback. To support systematic assessment, we introduce SPBench, a benchmark of real SP-doctor interactions spanning 14 specialties and eight expert-defined evaluation criteria. Experiments demonstrate that EasyMED matches human SP learning outcomes while producing greater skill gains for lower-baseline students and offering improved flexibility, psychological safety, and cost efficiency.

[7] Opinion Mining and Analysis Using Hybrid Deep Neural Networks

Adel Hidri, Suleiman Ali Alsaif, Muteeb Alahmari, Eman AlShehri, Minyar Sassi Hidri

Main category: cs.CL

TL;DR: Proposes a hybrid deep neural network (HBGRU-LSTM) combining bidirectional GRU and LSTM layers for sentiment analysis, achieving 95% accuracy and improved recall for negative sentiments.

DetailsMotivation: Existing sentiment analysis methods (lexicon-based, traditional ML) are insufficient for handling contextual nuances and scalability. Deep learning shows promise but needs improvement in capturing semantic relationships and addressing class imbalance.

Method: Hybrid deep neural network model combining bidirectional gated recurrent unit (BGRU) and long short-term memory (LSTM) layers to enhance sentiment analysis capabilities.

Result: Achieved 95% testing accuracy, outperforming traditional DL frameworks (LSTM: 93.06%, CNN+LSTM: 93.31%, GRU+LSTM: 92.20%). Improved recall for negative sentiments from 86% to 96% and reduced misclassification loss from 20.24% to 13.3% when using balanced datasets.

Conclusion: The HBGRU-LSTM model effectively addresses challenges in sentiment analysis including contextual nuance, scalability, and class imbalance, demonstrating superior performance and enhanced generalization capabilities compared to existing approaches.

Abstract: Understanding customer attitudes has become a critical component of decision-making due to the growing influence of social media and e-commerce. Text-based opinions are the most structured, hence playing an important role in sentiment analysis. Most of the existing methods, which include lexicon-based approaches and traditional machine learning techniques, are insufficient for handling contextual nuances and scalability. While the latter has limitations in model performance and generalization, deep learning (DL) has achieved improvement, especially on semantic relationship capturing with recurrent neural networks (RNNs) and convolutional neural networks (CNNs). The aim of the study is to enhance opinion mining by introducing a hybrid deep neural network model that combines a bidirectional gated recurrent unit (BGRU) and long short-term memory (LSTM) layers to improve sentiment analysis, particularly addressing challenges such as contextual nuance, scalability, and class imbalance. To substantiate the efficacy of the proposed model, we conducted comprehensive experiments utilizing benchmark datasets, encompassing IMDB movie critiques and Amazon product evaluations. The introduced hybrid BGRULSTM (HBGRU-LSTM) architecture attained a testing accuracy of 95%, exceeding the performance of traditional DL frameworks such as LSTM (93.06%), CNN+LSTM (93.31%), and GRU+LSTM (92.20%). Moreover, our model exhibited a noteworthy enhancement in recall for negative sentiments, escalating from 86% (unbalanced dataset) to 96% (balanced dataset), thereby ensuring a more equitable and just sentiment classification. Furthermore, the model diminished misclassification loss from 20.24% for unbalanced to 13.3% for balanced dataset, signifying enhanced generalization and resilience.

[8] NAMeGEn: Creative Name Generation via A Novel Agent-based Multiple Personalized Goal Enhancement Framework

Shanlin Zhou, Xinpeng Wang, Jianxun Lian, Zhenghao Liu, Laks V. S. Lakshmanan, Xiaoyuan Yi, Yongtao Hao

Main category: cs.CL

TL;DR: NAMEGEn is a multi-agent optimization framework for Chinese baby naming that addresses challenges in creative text generation by iteratively extracting objectives, generating names, and evaluating them to meet diverse user requirements while providing meaningful explanations.

DetailsMotivation: Creative Natural Language Generation (CNLG) faces challenges with multi-objective flexibility (personalized user requirements) and interpretive complexity (understanding implicit meanings), especially in short-form text generation like baby naming.

Method: Proposed NAMEGEn, a multi-agent optimization framework that alternates between objective extraction, name generation, and evaluation. Used classical Chinese poetry corpus (17k+ poems) to enhance aesthetics and created CBNames benchmark with tailored metrics.

Result: NAMEGEn effectively generates creative names meeting diverse personalized requirements while providing meaningful explanations, outperforming six baseline methods across various LLM backbones without any training.

Conclusion: The framework successfully addresses CNLG challenges in short-form text generation, demonstrating effectiveness in generating creative content that satisfies multiple user constraints and provides interpretive value.

Abstract: Trained on diverse human-authored texts, Large Language Models (LLMs) unlocked the potential for Creative Natural Language Generation (CNLG), benefiting various applications like advertising and storytelling. Nevertheless, CNLG still remains difficult due to two main challenges. (1) Multi-objective flexibility: user requirements are often personalized, fine-grained, and pluralistic, which LLMs struggle to satisfy simultaneously; (2) Interpretive complexity: beyond generation, creativity also involves understanding and interpreting implicit meaning to enhance users’ perception. These challenges significantly limit current methods, especially in short-form text generation, in generating creative and insightful content. To address this, we focus on Chinese baby naming, a representative short-form CNLG task requiring adherence to explicit user constraints (e.g., length, semantics, anthroponymy) while offering meaningful aesthetic explanations. We propose NAMeGEn, a novel multi-agent optimization framework that iteratively alternates between objective extraction, name generation, and evaluation to meet diverse requirements and generate accurate explanations. To support this task, we further construct a classical Chinese poetry corpus with 17k+ poems to enhance aesthetics, and introduce CBNames, a new benchmark with tailored metrics. Extensive experiments demonstrate that NAMeGEn effectively generates creative names that meet diverse, personalized requirements while providing meaningful explanations, outperforming six baseline methods spanning various LLM backbones without any training.

[9] Hierarchical Token Prepending: Enhancing Information Flow in Decoder-based LLM Embeddings

Xueying Ding, Xingyue Huang, Mingxuan Ju, Liam Collins, Yozen Liu, Leman Akoglu, Neil Shah, Tong Zhao

Main category: cs.CL

TL;DR: HTP improves text embeddings by prepending hierarchical summary tokens to enable backward information flow in attention mechanisms, achieving better performance on long documents.

DetailsMotivation: Standard LLM embeddings suffer from causal attention that restricts backward information flow, and existing single-token prepending methods over-compress information in long documents.

Method: Partitions input into blocks and prepends block-level summary tokens to subsequent blocks, plus uses mean-pooling instead of last-token pooling to address over-squashing.

Result: Consistent performance gains across 11 retrieval datasets and 30 general embedding benchmarks, with especially strong improvements in long-context settings.

Conclusion: HTP provides a simple, architecture-agnostic method that enhances both zero-shot and finetuned models for superior long-document embeddings.

Abstract: Large language models produce powerful text embeddings, but their causal attention mechanism restricts the flow of information from later to earlier tokens, degrading representation quality. While recent methods attempt to solve this by prepending a single summary token, they over-compress information, hence harming performance on long documents. We propose Hierarchical Token Prepending (HTP), a method that resolves two critical bottlenecks. To mitigate attention-level compression, HTP partitions the input into blocks and prepends block-level summary tokens to subsequent blocks, creating multiple pathways for backward information flow. To address readout-level over-squashing, we replace last-token pooling with mean-pooling, a choice supported by theoretical analysis. HTP achieves consistent performance gains across 11 retrieval datasets and 30 general embedding benchmarks, especially in long-context settings. As a simple, architecture-agnostic method, HTP enhances both zero-shot and finetuned models, offering a scalable route to superior long-document embeddings.

[10] Retrieval Augmented Generation based context discovery for ASR

Dimitrios Siskos, Stavros Papadopoulos, Pablo Peso Parada, Jisi Zhang, Karthikeyan Saravanan, Anastasios Drosou

Main category: cs.CL

TL;DR: Proposes embedding-based retrieval for automatic context discovery in ASR to improve transcription of rare terms, achieving up to 17% WER reduction.

DetailsMotivation: Improve ASR accuracy for rare/out-of-vocabulary terms by automatically discovering relevant context, addressing the challenge of identifying appropriate context.

Method: Embedding-based retrieval approach for automatic context discovery, compared with two LLM alternatives: LLM-based context generation via prompting and post-recognition transcript correction.

Result: Proposed approach reduces WER by up to 17% relative to no-context baseline, while oracle context achieves up to 24.1% reduction on TED-LIUMv3, Earnings21 and SPGISpeech datasets.

Conclusion: Embedding-based retrieval is an effective strategy for automatic context discovery in ASR, significantly improving transcription accuracy for rare terms.

Abstract: This work investigates retrieval augmented generation as an efficient strategy for automatic context discovery in context-aware Automatic Speech Recognition (ASR) system, in order to improve transcription accuracy in the presence of rare or out-of-vocabulary terms. However, identifying the right context automatically remains an open challenge. This work proposes an efficient embedding-based retrieval approach for automatic context discovery in ASR. To contextualize its effectiveness, two alternatives based on large language models (LLMs) are also evaluated: (1) large language model (LLM)-based context generation via prompting, and (2) post-recognition transcript correction using LLMs. Experiments on the TED-LIUMv3, Earnings21 and SPGISpeech demonstrate that the proposed approach reduces WER by up to 17% (percentage difference) relative to using no-context, while the oracle context results in a reduction of up to 24.1%.

[11] Step-Audio-EditX Technical Report

Chao Yan, Boyong Wu, Peng Yang, Pengfei Tan, Guoqiang Hu, Li Xie, Yuxin Zhang, Xiangyu, Zhang, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Shuchang Zhou, Gang Yu

Main category: cs.CL

TL;DR: Step-Audio-EditX is the first open-source LLM-based audio model that excels at expressive audio editing (emotion, speaking style, paralinguistics) and zero-shot TTS, using only large-margin synthetic data without embedding priors.

DetailsMotivation: To create an expressive audio editing model that can handle fine-grained control over emotion, speaking style, and paralinguistics while avoiding the limitations of conventional representation-level disentanglement approaches.

Method: Leverages large-margin synthetic data exclusively, eliminating the need for embedding-based priors or auxiliary modules. This enables both iterative control and high expressivity across different voices.

Result: Outperforms MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.

Conclusion: The large-margin learning approach represents a fundamental shift from conventional representation-level disentanglement and enables superior expressive audio editing capabilities.

Abstract: We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero-shot text-to-speech (TTS) capabilities. Our core innovation lies in leveraging only large-margin synthetic data, which circumvents the need for embedding-based priors or auxiliary modules. This large-margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation-level disentanglement. Evaluation results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.

[12] Mathematical Analysis of Hallucination Dynamics in Large Language Models: Uncertainty Quantification, Advanced Decoding, and Principled Mitigation

Moses Kiprono

Main category: cs.CL

TL;DR: A mathematical framework to understand, measure, and mitigate hallucinations in Large Language Models using probabilistic modeling, information theory, and Bayesian methods.

DetailsMotivation: LLMs are powerful but susceptible to hallucinations - plausible-sounding outputs that are factually incorrect or unsupported, which undermines their reliability.

Method: Uses probabilistic modeling, information theory, trigonometric signal analysis, and Bayesian uncertainty estimation to analyze autoregressive error compounding, proposes refined uncertainty metrics (semantic and phase-aware), and develops mitigation strategies including contrastive decoding, retrieval-augmented grounding, factual alignment, and abstention.

Result: Developed a unified mathematical framework that connects recent advances in calibration, retrieval, and alignment to address LLM hallucinations.

Conclusion: The framework provides principled approaches to support safer and more reliable LLMs by systematically understanding and mitigating hallucinations through mathematical foundations.

Abstract: Large Language Models (LLMs) are powerful linguistic engines but remain susceptible to hallucinations: plausible-sounding outputs that are factually incorrect or unsupported. In this work, we present a mathematically grounded framework to understand, measure, and mitigate these hallucinations. Drawing on probabilistic modeling, information theory, trigonometric signal analysis, and Bayesian uncertainty estimation, we analyze how errors compound autoregressively, propose refined uncertainty metrics, including semantic and phase-aware variants, and develop principled mitigation strategies such as contrastive decoding, retrieval-augmented grounding, factual alignment, and abstention. This unified lens connects recent advances in calibration, retrieval, and alignment to support safer and more reliable LLMs.

[13] Teaching According to Students’ Aptitude: Personalized Mathematics Tutoring via Persona-, Memory-, and Forgetting-Aware LLMs

Yang Wu, Rujing Yao, Tong Zhang, Yufei Shi, Zhuoren Jiang, Zhushan Li, Xiaozhong Liu

Main category: cs.CL

TL;DR: TASA is a student-aware tutoring framework that integrates persona, memory, and forgetting dynamics for personalized mathematics learning using LLMs.

DetailsMotivation: Existing LLM-based tutoring systems fail to capture how students' knowledge evolves dynamically across proficiencies, conceptual gaps, and forgetting patterns, particularly in mathematics where fine-grained scaffolding is crucial.

Method: TASA maintains structured student persona (proficiency profiles) and event memory (prior interactions), incorporates continuous forgetting curve with knowledge tracing to dynamically update mastery states, and generates difficulty-calibrated questions/explanations.

Result: Empirical results show TASA achieves superior learning outcomes and more adaptive tutoring behavior compared to representative baselines.

Conclusion: Modeling temporal forgetting and learner profiles is crucial for effective LLM-based tutoring systems in mathematics education.

Abstract: Large Language Models (LLMs) are increasingly integrated into intelligent tutoring systems to provide human-like and adaptive instruction. However, most existing approaches fail to capture how students’ knowledge evolves dynamically across their proficiencies, conceptual gaps, and forgetting patterns. This challenge is particularly acute in mathematics tutoring, where effective instruction requires fine-grained scaffolding precisely calibrated to each student’s mastery level and cognitive retention. To address this issue, we propose TASA (Teaching According to Students’ Aptitude), a student-aware tutoring framework that integrates persona, memory, and forgetting dynamics for personalized mathematics learning. Specifically, TASA maintains a structured student persona capturing proficiency profiles and an event memory recording prior learning interactions. By incorporating a continuous forgetting curve with knowledge tracing, TASA dynamically updates each student’s mastery state and generates contextually appropriate, difficulty-calibrated questions and explanations. Empirical results demonstrate that TASA achieves superior learning outcomes and more adaptive tutoring behavior compared to representative baselines, underscoring the importance of modeling temporal forgetting and learner profiles in LLM-based tutoring systems.

[14] HinTel-AlignBench: A Framework and Benchmark for Hindi-Telugu with English-Aligned Samples

Rishikant Chigrupaatii, Ponnada Sai Tulasi Kanishka, Lalit Chandra Routhu, Martin Patel Sama Supratheek Reddy, Divyam Gupta, Dasari Srikar, Krishna Teja Kuchimanchi, Rajiv Misra, Rohun Tripathi

Main category: cs.CL

TL;DR: A framework for evaluating Vision-Language Models in Indian languages, addressing limitations in current multilingual VLM evaluations through semi-automated dataset creation and comprehensive benchmarking.

DetailsMotivation: To address gaps in multilingual VLM evaluations including reliance on auto-translations, narrow task coverage, limited sample sizes, and lack of culturally relevant content for Indian languages.

Method: Semi-automated dataset creation framework using back-translation, filtering, and human verification to create HinTel-AlignBench with diverse sources in Hindi and Telugu, including adapted English datasets and native Indic datasets.

Result: Found performance regression in Indian languages vs English for 4 out of 5 tasks across all models, with average regression of 8.3 points in Hindi and 5.5 points in Telugu. Identified common failure modes in multilingual multimodal understanding.

Conclusion: The framework enables robust evaluation of VLMs for Indian languages, revealing significant performance gaps and highlighting areas for improvement in multilingual multimodal AI systems.

Abstract: With nearly 1.5 billion people and more than 120 major languages, India represents one of the most diverse regions in the world. As multilingual Vision-Language Models (VLMs) gain prominence, robust evaluation methodologies are essential to drive progress toward equitable AI for low-resource languages. Current multilingual VLM evaluations suffer from four major limitations: reliance on unverified auto-translations, narrow task/domain coverage, limited sample sizes, and lack of cultural and natively sourced Question-Answering (QA). To address these gaps, we present a scalable framework to evaluate VLMs in Indian languages and compare it with performance in English. Using the framework, we generate HinTel-AlignBench, a benchmark that draws from diverse sources in Hindi and Telugu with English-aligned samples. Our contributions are threefold: (1) a semi-automated dataset creation framework combining back-translation, filtering, and human verification; (2) the most comprehensive vision-language benchmark for Hindi and and Telugu, including adapted English datasets (VQAv2, RealWorldQA, CLEVR-Math) and native novel Indic datasets (JEE for STEM, VAANI for cultural grounding) with approximately 4,000 QA pairs per language; and (3) a detailed performance analysis of various State-of-the-Art (SOTA) open-weight and closed-source VLMs. We find a regression in performance for tasks in English versus in Indian languages for 4 out of 5 tasks across all the models, with an average regression of 8.3 points in Hindi and 5.5 points for Telugu. We categorize common failure modes to highlight concrete areas of improvement in multilingual multimodal understanding.

[15] Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story

Vladislav Pedashenko, Laida Kushnareva, Yana Khassan Nibal, Eduard Tulchinskii, Kristian Kuznetsov, Vladislav Zharchinskii, Yury Maximov, Irina Piontkovskaya

Main category: cs.CL

TL;DR: Intrinsic dimension (ID) in LLMs is determined by text genre and linguistic features, with scientific text having low ID and creative/opinion writing having high ID, revealing different representational complexities.

DetailsMotivation: To understand the textual determinants of intrinsic dimension in LLMs, which is important for analyzing training dynamics, scaling behavior, and dataset structure but remains underexplored.

Method: Cross-encoder analysis, linguistic feature analysis, and sparse autoencoders (SAEs) to identify causal features affecting ID, plus steering experiments to confirm causality.

Result: ID is uncorrelated with entropy after controlling for length; scientific prose has low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5); SAEs identified scientific signals reduce ID while humanized signals increase it.

Conclusion: Scientific writing appears “easy” for contemporary models while fiction, opinion, and affect add representational degrees of freedom, providing practical guidance for proper use and interpretation of ID-based results.

Abstract: Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary LLMs find scientific text “representationally simple” while fiction requires additional degrees of freedom. Third, using SAEs, we identify causal features: scientific signals (formal tone, report templates, statistics) reduce ID; humanized signals (personalization, emotion, narrative) increase it. Steering experiments confirm these effects are causal. Thus, for contemporary models, scientific writing appears comparatively “easy”, whereas fiction, opinion, and affect add representational degrees of freedom. Our multi-faceted analysis provides practical guidance for the proper use of ID and the sound interpretation of ID-based results.

[16] OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition

Xinli Tao, Xin Dong, Xuezhong Zhou

Main category: cs.CL

TL;DR: OEMA is a zero-shot clinical NER framework using multi-agent collaboration that achieves near-supervised performance without requiring annotated data.

DetailsMotivation: Supervised clinical NER models require costly annotated data, while existing zero-shot approaches struggle with example selection granularity and integrating prompts with self-improvement.

Method: Three-component framework: self-annotator generates examples, discriminator filters them via SNOMED CT, and predictor uses entity descriptions for accurate inference.

Result: Achieves state-of-the-art exact-match performance on MTSamples and VAERS datasets, matches supervised BioClinicalBERT under related-match, and surpasses CRF.

Conclusion: OEMA addresses key zero-shot NER challenges through ontology-guided reasoning and multi-agent collaboration, showing promise for clinical NLP applications.

Abstract: Clinical named entity recognition (NER) is crucial for extracting information from electronic health records (EHRs), but supervised models like CRF and BioClinicalBERT require costly annotated data. While zero-shot NER with large language models (LLMs) reduces this dependency, it struggles with example selection granularity and integrating prompts with self-improvement. To address this, we propose OEMA, a zero-shot clinical NER framework using multi-agent collaboration. OEMA’s three components are: a self-annotator generating examples, a discriminator filtering them via SNOMED CT, and a predictor using entity descriptions for accurate inference. On MTSamples and VAERS datasets, OEMA achieves state-of-the-art exact-match performance. Under related-match, it matches supervised BioClinicalBERT and surpasses CRF. OEMA addresses key zero-shot NER challenges through ontology-guided reasoning and multi-agent collaboration, achieving near-supervised performance and showing promise for clinical NLP applications.

[17] Context Cascade Compression: Exploring the Upper Limits of Text Compression

Fanfan Liu, Haibo Qiu

Main category: cs.CL

TL;DR: C3 (Context Cascade Compression) uses two LLMs in cascade to compress long contexts into latent tokens with high compression ratios (20x-40x), achieving 93-98% decoding accuracy while outperforming optical character compression methods.

DetailsMotivation: Address computational and memory challenges of million-level token inputs in long-context tasks for LLMs, inspired by DeepSeek-OCR's preliminary optical compression research.

Method: Cascades two LLMs: small LLM compresses long context into latent tokens (32-64 length), large LLM decodes from compressed context. Uses pure-text pipeline ignoring visual factors like layout and color.

Result: Achieves 98% decoding accuracy at 20x compression ratio (vs ~60% for DeepSeek-OCR), maintains ~93% accuracy at 40x compression ratio.

Conclusion: C3 demonstrates superior performance over optical character compression, suggests upper bound for compression ratios in OCR and related fields, and shows feasibility of high-ratio text compression.

Abstract: Million-level token inputs in long-context tasks pose significant computational and memory challenges for Large Language Models (LLMs). Recently, DeepSeek-OCR conducted research into the feasibility of Contexts Optical Compression and achieved preliminary results. Inspired by this, we introduce Context Cascade Compression C3 to explore the upper limits of text compression. Our method cascades two LLMs of different sizes to handle the compression and decoding tasks. Specifically, a small LLM, acting as the first stage, performs text compression by condensing a long context into a set of latent tokens (e.g., 32 or 64 in length), achieving a high ratio of text tokens to latent tokens. A large LLM, as the second stage, then executes the decoding task on this compressed context. Experiments show that at a 20x compression ratio (where the number of text tokens is 20 times the number of latent tokens), our model achieves 98% decoding accuracy, compared to approximately 60% for DeepSeek-OCR. When we further increase the compression ratio to 40x, the accuracy is maintained at around 93%. This indicates that in the domain of context compression, C3 Compression demonstrates superior performance and feasibility over optical character compression. C3 uses a simpler, pure-text pipeline that ignores factors like layout, color, and information loss from a visual encoder. This also suggests a potential upper bound for compression ratios in future work on optical character compression, OCR, and related fields. Codes and model weights are publicly accessible at https://github.com/liufanfanlff/C3-Context-Cascade-Compression

[18] IndicGEC: Powerful Models, or a Measurement Mirage?

Sowmya Vajjala

Main category: cs.CL

TL;DR: TeamNRC’s zero/few-shot prompting approach using language models (4B to large proprietary) achieved top rankings in Telugu (Rank 4) and Hindi (Rank 2) GEC tasks, with extensions to Tamil, Malayalam, and Bangla highlighting small model potential and dataset quality concerns.

DetailsMotivation: To participate in the BHASHA-Task 1 Grammatical Error Correction shared task for 5 Indian languages and explore the effectiveness of zero/few-shot prompting approaches across different language model sizes.

Method: Used zero/few-shot prompting of language models ranging from 4B parameters to large proprietary models, extending experiments to Tamil, Malayalam, and Bangla languages.

Result: Achieved Rank 4 in Telugu (GLEU: 83.78) and Rank 2 in Hindi (GLEU: 84.31), demonstrating the potential of small language models for Indian language GEC tasks.

Conclusion: Small language models show significant potential for Indian language grammatical error correction, but concerns remain about dataset quality and appropriate evaluation metrics for Indian language scripts.

Abstract: In this paper, we report the results of the TeamNRC’s participation in the BHASHA-Task 1 Grammatical Error Correction shared task https://github.com/BHASHA-Workshop/IndicGEC2025/ for 5 Indian languages. Our approach, focusing on zero/few-shot prompting of language models of varying sizes (4B to large proprietary models) achieved a Rank 4 in Telugu and Rank 2 in Hindi with GLEU scores of 83.78 and 84.31 respectively. In this paper, we extend the experiments to the other three languages of the shared task - Tamil, Malayalam and Bangla, and take a closer look at the data quality and evaluation metric used. Our results primarily highlight the potential of small language models, and summarize the concerns related to creating good quality datasets and appropriate metrics for this task that are suitable for Indian language scripts.

[19] MAPROC at AHaSIS Shared Task: Few-Shot and Sentence Transformer for Sentiment Analysis of Arabic Hotel Reviews

Randa Zarnoufi

Main category: cs.CL

TL;DR: The paper presents a sentiment analysis system for Arabic dialects using SetFit framework, achieving 73% F1 score and ranking 12th in the AHaSIS shared task.

DetailsMotivation: Address challenges in Arabic dialect sentiment analysis due to linguistic diversity and scarcity of annotated data, particularly in specialized domains like hospitality.

Method: Employed SetFit (Sentence Transformer Fine-tuning) framework, a data-efficient few-shot learning technique for sentiment classification on Moroccan and Saudi dialect hotel reviews.

Result: Achieved 73% F1 score on official evaluation set, ranking 12th among 26 participants in the AHaSIS shared task.

Conclusion: Demonstrates the potential of few-shot learning to effectively handle data scarcity in processing nuanced dialectal Arabic text within specialized domains.

Abstract: Sentiment analysis of Arabic dialects presents significant challenges due to linguistic diversity and the scarcity of annotated data. This paper describes our approach to the AHaSIS shared task, which focuses on sentiment analysis on Arabic dialects in the hospitality domain. The dataset comprises hotel reviews written in Moroccan and Saudi dialects, and the objective is to classify the reviewers sentiment as positive, negative, or neutral. We employed the SetFit (Sentence Transformer Fine-tuning) framework, a data-efficient few-shot learning technique. On the official evaluation set, our system achieved an F1 of 73%, ranking 12th among 26 participants. This work highlights the potential of few-shot learning to address data scarcity in processing nuanced dialectal Arabic text within specialized domains like hotel reviews.

[20] Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Francesco Giarrusso, Marcantonio Bracale, Marcello Galisai, Vincenzo Suriani, Olga Sorokoletova, Federico Sartore, Daniele Nardi

Main category: cs.CL

TL;DR: Poetic prompts function as universal jailbreaks for LLMs, achieving high attack success rates across 25 models and outperforming non-poetic baselines by up to 18x.

DetailsMotivation: To investigate whether stylistic variation through poetry can systematically circumvent LLM safety mechanisms, revealing limitations in current alignment methods.

Method: Used curated poetic prompts and converted 1,200 harmful prompts into verse via standardized meta-prompt, evaluated with ensemble judge models and human validation.

Result: Achieved 62% success for hand-crafted poems and 43% for meta-prompt conversions, substantially outperforming non-poetic baselines across multiple risk domains.

Conclusion: Poetic framing reveals systematic vulnerability in LLM safety, suggesting fundamental limitations in current alignment and evaluation approaches.

Abstract: We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for large language models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of open-weight judge models and a human-validated stratified subset (with double-annotations to measure agreement). Disagreements were manually resolved. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.

[21] HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning

Alexis Correa-Guillén, Carlos Gómez-Rodríguez, David Vilares

Main category: cs.CL

TL;DR: HEAD-QA v2 is an expanded Spanish/English healthcare reasoning dataset with 12,000+ questions from professional exams, used to benchmark LLMs and support multilingual research.

DetailsMotivation: Address the need for high-quality datasets capturing healthcare reasoning complexity and support multilingual biomedical reasoning research.

Method: Extended dataset from Spanish professional exams, benchmarked open-source LLMs using prompting, RAG, and probability-based answer selection, created multilingual versions.

Result: Performance mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies showing limited gains.

Conclusion: HEAD-QA v2 serves as a reliable resource for advancing biomedical reasoning research and model improvement.

Abstract: We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and Gómez-Rodríguez (2019). The update responds to the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning. We extend the dataset to over 12,000 questions from ten years of Spanish professional exams, benchmark several open-source LLMs using prompting, RAG, and probability-based answer selection, and provide additional multilingual versions to support future work. Results indicate that performance is mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies obtaining limited gains. Together, these results establish HEAD-QA v2 as a reliable resource for advancing research on biomedical reasoning and model improvement.

[22] The Empowerment of Science of Science by Large Language Models: New Tools and Methods

Guoqiang Liang, Jingqian Gong, Mengxuan Li, Gege Lin, Shuo Zhang

Main category: cs.CL

TL;DR: This paper reviews LLM technologies for users and explores their applications in scientometrics, including AI agents for scientific evaluation, research front detection, and knowledge graph construction.

DetailsMotivation: LLMs show exceptional capabilities in various domains and are advancing towards AGI, making them crucial in global technology. The paper aims to provide a comprehensive review of LLM technologies from a user perspective and explore their potential in the scientometric domain.

Method: Conducts comprehensive review of core LLM technologies (prompt engineering, knowledge-enhanced RAG, fine-tuning, pretraining, tool learning), traces SciSci development history, and presents forward-looking applications in scientometrics.

Result: Presents a systematic overview of LLM technologies for users and proposes specific applications in scientometrics including AI agent-based scientific evaluation models, new research front detection methods, and knowledge graph building approaches using LLMs.

Conclusion: LLMs have significant potential to transform scientometrics through AI agent-based evaluation systems, enhanced research front detection, and automated knowledge graph construction, representing important future directions for the field.

Abstract: Large language models (LLMs) have exhibited exceptional capabilities in natural language understanding and generation, image recognition, and multimodal tasks, charting a course towards AGI and emerging as a central issue in the global technological race. This manuscript conducts a comprehensive review of the core technologies that support LLMs from a user standpoint, including prompt engineering, knowledge-enhanced retrieval augmented generation, fine tuning, pretraining, and tool learning. Additionally, it traces the historical development of Science of Science (SciSci) and presents a forward looking perspective on the potential applications of LLMs within the scientometric domain. Furthermore, it discusses the prospect of an AI agent based model for scientific evaluation, and presents new research fronts detection and knowledge graph building methods with LLMs.

[23] A Compliance-Preserving Retrieval System for Aircraft MRO Task Search

Byungho Jo

Main category: cs.CL

TL;DR: A compliance-preserving retrieval system for aircraft maintenance that uses LLM reranking and semantic search to reduce manual lookup time by 95% while maintaining regulatory compliance.

DetailsMotivation: Aircraft Maintenance Technicians spend up to 30% of work time searching manuals, creating efficiency bottlenecks in MRO operations where every procedure must be traceable to certified sources.

Method: The system adapts LLM reranking and semantic search to operate alongside certified legacy viewers, constructs revision-robust embeddings from ATA chapter hierarchies, and uses vision-language parsing to structure certified content.

Result: Evaluation on 49k synthetic queries achieved >90% retrieval accuracy, bilingual studies with 10 licensed AMTs showed 90.9% top-10 success rate and 95% reduction in lookup time (from 6-15 minutes to 18 seconds per task).

Conclusion: Semantic retrieval can operate within strict regulatory constraints and meaningfully reduce operational workload in real-world multilingual MRO workflows.

Abstract: Aircraft Maintenance Technicians (AMTs) spend up to 30% of work time searching manuals, a documented efficiency bottleneck in MRO operations where every procedure must be traceable to certified sources. We present a compliance-preserving retrieval system that adapts LLM reranking and semantic search to aviation MRO environments by operating alongside, rather than replacing, certified legacy viewers. The system constructs revision-robust embeddings from ATA chapter hierarchies and uses vision-language parsing to structure certified content, allowing technicians to preview ranked tasks and access verified procedures in existing viewers. Evaluation on 49k synthetic queries achieves >90% retrieval accuracy, while bilingual controlled studies with 10 licensed AMTs demonstrate 90.9% top-10 success rate and 95% reduction in lookup time, from 6-15 minutes to 18 seconds per task. These gains provide concrete evidence that semantic retrieval can operate within strict regulatory constraints and meaningfully reduce operational workload in real-world multilingual MRO workflows.

[24] DEPO: Dual-Efficiency Preference Optimization for LLM Agents

Sirui Chen, Mengshi Zhao, Lei Xu, Yuying Zhao, Beier Zhu, Hanwang Zhang, Shengjie Zhao, Chaochao Lu

Main category: cs.CL

TL;DR: DEPO is a dual-efficiency preference optimization method that improves LLM agent efficiency by reducing both token usage per step and number of steps needed to complete tasks, achieving significant efficiency gains while maintaining or improving performance.

DetailsMotivation: Richer reasoning in LLM agents often leads to longer chain of thought, hampering interaction efficiency in real-world scenarios, but there's a lack of systematic definition of LLM agent efficiency for targeted improvements.

Method: Proposed DEPO - a dual-efficiency preference optimization method that jointly rewards succinct responses (step-level efficiency) and fewer action steps (trajectory-level efficiency).

Result: DEPO cuts token usage by up to 60.9% and steps by up to 26.9%, while achieving up to 29.3% improvement in performance on WebShop and BabyAI. It also generalizes to math benchmarks and retains efficiency gains with only 25% of training data.

Conclusion: DEPO effectively addresses LLM agent efficiency through dual-efficiency optimization, significantly reducing computational costs while maintaining or improving task performance across multiple domains.

Abstract: Recent advances in large language models (LLMs) have greatly improved their reasoning and decision-making abilities when deployed as agents. Richer reasoning, however, often comes at the cost of longer chain of thought (CoT), hampering interaction efficiency in real-world scenarios. Nevertheless, there still lacks systematic definition of LLM agent efficiency, hindering targeted improvements. To this end, we introduce dual-efficiency, comprising (i) step-level efficiency, which minimizes tokens per step, and (ii) trajectory-level efficiency, which minimizes the number of steps to complete a task. Building on this definition, we propose DEPO, a dual-efficiency preference optimization method that jointly rewards succinct responses and fewer action steps. Experiments on WebShop and BabyAI show that DEPO cuts token usage by up to 60.9% and steps by up to 26.9%, while achieving up to a 29.3% improvement in performance. DEPO also generalizes to three out-of-domain math benchmarks and retains its efficiency gains when trained on only 25% of the data. Our project page is at https://opencausalab.github.io/DEPO.

[25] Building Robust and Scalable Multilingual ASR for Indian Languages

Arjun Gangwar, Kaousheik Jayakumar, S. Umesh

Main category: cs.CL

TL;DR: SPRING Lab developed multilingual ASR systems for ASRU MADASR 2.0 challenge using Multi-Decoder architecture with phonemic Common Label Set, achieving top performance in language/dialect ID and beating baseline in 3 languages.

DetailsMotivation: To adapt ASR systems for improved language and dialect identification across 8 languages and 33 dialects, participating in Track 1 and Track 2 with data restrictions.

Method: Novel training approach using Multi-Decoder architecture with phonemic Common Label Set (CLS) as intermediate representation, with techniques to retain gains when converting back to grapheme representations.

Result: Beat baseline in 3 languages (Track 2) in WER/CER, achieved highest language ID and dialect ID accuracy among all participating teams in Track 2.

Conclusion: The Multi-Decoder architecture with phonemic CLS effectively improves multilingual ASR performance and language/dialect identification capabilities.

Abstract: This paper describes the systems developed by SPRING Lab, Indian Institute of Technology Madras, for the ASRU MADASR 2.0 challenge. The systems developed focuses on adapting ASR systems to improve in predicting the language and dialect of the utterance among 8 languages across 33 dialects. We participated in Track 1 and Track 2, which restricts the use of additional data and develop from-the-scratch multilingual systems. We presented a novel training approach using Multi-Decoder architecture with phonemic Common Label Set (CLS) as intermediate representation. It improved the performance over the baseline (in the CLS space). We also discuss various methods used to retain the gain obtained in the phonemic space while converting them back to the corresponding grapheme representations. Our systems beat the baseline in 3 languages (Track 2) in terms of WER/CER and achieved the highest language ID and dialect ID accuracy among all participating teams (Track 2).

[26] LLM-MemCluster: Empowering Large Language Models with Dynamic Memory for Text Clustering

Yuanjie Zhu, Liangwei Yang, Ke Xu, Weizhi Zhang, Zihe Song, Jindong Wang, Philip S. Yu

Main category: cs.CL

TL;DR: LLM-MemCluster is a novel framework that enables fully LLM-native text clustering using dynamic memory for state awareness and dual-prompt strategy for cluster number determination, achieving superior performance without tuning.

DetailsMotivation: Current LLM-based clustering methods are limited by lack of stateful memory for iterative refinement and difficulty managing cluster granularity, forcing reliance on complex external modules rather than true end-to-end approaches.

Method: Uses Dynamic Memory to provide state awareness and Dual-Prompt Strategy that enables the model to reason about and determine the optimal number of clusters, creating a fully LLM-native clustering framework.

Result: Significantly and consistently outperforms strong baselines on several benchmark datasets while being tuning-free.

Conclusion: LLM-MemCluster presents an effective, interpretable, and truly end-to-end paradigm for LLM-based text clustering that overcomes fundamental limitations of existing approaches.

Abstract: Large Language Models (LLMs) are reshaping unsupervised learning by offering an unprecedented ability to perform text clustering based on their deep semantic understanding. However, their direct application is fundamentally limited by a lack of stateful memory for iterative refinement and the difficulty of managing cluster granularity. As a result, existing methods often rely on complex pipelines with external modules, sacrificing a truly end-to-end approach. We introduce LLM-MemCluster, a novel framework that reconceptualizes clustering as a fully LLM-native task. It leverages a Dynamic Memory to instill state awareness and a Dual-Prompt Strategy to enable the model to reason about and determine the number of clusters. Evaluated on several benchmark datasets, our tuning-free framework significantly and consistently outperforms strong baselines. LLM-MemCluster presents an effective, interpretable, and truly end-to-end paradigm for LLM-based text clustering.

[27] Standardising the NLP Workflow: A Framework for Reproducible Linguistic Analysis

Yves Pauli, Jan-Bernard Marsman, Finn Rabe, Victoria Edkins, Roya Hüppi, Silvia Ciampelli, Akhil Ratan Misra, Nils Lang, Wolfram Hinzen, Iris Sommer, Philipp Homan

Main category: cs.CL

TL;DR: Proposes LPDS data structure and pelican nlp Python package to standardize and streamline language processing workflows, enhancing reproducibility in linguistic research.

DetailsMotivation: Address challenges in AI-based language processing including lack of standardization in data organization/sharing and absence of reproducible processing methodologies.

Method: Introduces LPDS (Language Processing Data Structure) inspired by BIDS for folder/file organization, and pelican nlp - a modular Python package for end-to-end language processing with shareable configuration files.

Result: Provides standardized data structure and processing pipeline enabling reproducible output of preprocessed data, linguistic/acoustic features, and result aggregations.

Conclusion: LPDS and pelican nlp collectively offer an end-to-end solution for transparent and reproducible linguistic data processing.

Abstract: The introduction of large language models and other influential developments in AI-based language processing have led to an evolution in the methods available to quantitatively analyse language data. With the resultant growth of attention on language processing, significant challenges have emerged, including the lack of standardisation in organising and sharing linguistic data and the absence of standardised and reproducible processing methodologies. Striving for future standardisation, we first propose the Language Processing Data Structure (LPDS), a data structure inspired by the Brain Imaging Data Structure (BIDS), a widely adopted standard for handling neuroscience data. It provides a folder structure and file naming conventions for linguistic research. Second, we introduce pelican nlp, a modular and extensible Python package designed to enable streamlined language processing, from initial data cleaning and task-specific preprocessing to the extraction of sophisticated linguistic and acoustic features, such as semantic embeddings and prosodic metrics. The entire processing workflow can be specified within a single, shareable configuration file, which pelican nlp then executes on LPDS-formatted data. Depending on the specifications, the reproducible output can consist of preprocessed language data or standardised extraction of both linguistic and acoustic features and corresponding result aggregations. LPDS and pelican nlp collectively offer an end-to-end processing pipeline for linguistic data, designed to ensure methodological transparency and enhance reproducibility.

[28] Multimodal Evaluation of Russian-language Architectures

Artem Chervyakov, Ulyana Isaeva, Anton Emelyanov, Artem Safin, Maria Tikhonova, Alexander Kharitonov, Yulia Lyakh, Petr Surovtsev, Denis Shevelev Vildan Saburov, Vasily Konovalov, Elisei Rykov, Ivan Sviridov, Amina Miftakhova, Ilseyar Alimova, Alexander Panchenko, Alexander Kapitanov, Alena Fenogenova

Main category: cs.CL

TL;DR: Mera Multi is a new multimodal evaluation framework for Russian-spoken MLLMs, featuring 18 tasks across text, image, audio, and video modalities with culturally specific Russian datasets.

DetailsMotivation: Address the lack of multimodal benchmarks for Russian language and understand the intelligence, limitations, and risks of MLLMs in this context.

Method: Created 18 evaluation tasks from scratch with Russian cultural specificity, unified prompts and metrics, plus watermarking and licensing for benchmark leakage prevention.

Result: Established baseline results for both closed-source and open-source models using the new Russian multimodal benchmark.

Conclusion: Provides a replicable methodology for creating multimodal benchmarks in diverse languages, particularly Slavic languages, with current focus on Russian.

Abstract: Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.

[29] HSKBenchmark: Modeling and Benchmarking Chinese Second Language Acquisition in Large Language Models through Curriculum Tuning

Qihao Yang, Xuelin Wang, Jiale Chen, Xuelian Dong, Yuxin Hao, Tianyong Hao

Main category: cs.CL

TL;DR: HSKBenchmark is the first benchmark for staged modeling and writing assessment of LLMs in Chinese second language acquisition, covering HSK levels 3-6 with authentic textbooks, synthetic instructions, and a linguistically grounded evaluation system.

DetailsMotivation: Language acquisition is crucial for understanding human language intelligence and improving LLM interpretability, but ethical and practical constraints limit human experiments. LLMs offer a controllable alternative, but lack systematic benchmarks for phase-wise modeling in Chinese SLA.

Method: Created HSKBenchmark with authentic textbooks (6.76M tokens), 16K synthetic instruction samples, 30 test topics, and a curriculum-tuning framework that trains models from beginner to advanced levels. Built HSKAgent fine-tuned on 10K learner compositions.

Result: Fine-tuned LLMs achieve writing performance comparable to advanced human learners and exhibit human-like acquisition characteristics. The benchmark effectively models Chinese SLA and serves as a reliable tool for dynamic writing assessment.

Conclusion: HSKBenchmark, HSKAgent, and checkpoints provide foundational resources that can advance research on language acquisition modeling and LLM interpretability, with publicly available code and data.

Abstract: Language acquisition is vital to revealing the nature of human language intelligence and has recently emerged as a promising perspective for improving the interpretability of large language models (LLMs). However, it is ethically and practically infeasible to conduct experiments that require controlling human learners’ language inputs. This poses challenges for the verifiability and scalability of language acquisition modeling, particularly in Chinese second language acquisition (SLA). While LLMs provide a controllable and reproducible alternative, a systematic benchmark to support phase-wise modeling and assessment is still lacking. In this paper, we present HSKBenchmark, the first benchmark for staged modeling and writing assessment of LLMs in Chinese SLA. It covers HSK levels 3 to 6 and includes authentic textbooks with 6.76 million tokens, 16K synthetic instruction samples, 30 test topics, and a linguistically grounded evaluation system. To simulate human learning trajectories, we introduce a curriculum-tuning framework that trains models from beginner to advanced levels. An evaluation system is created to examine level-based grammar coverage, writing errors, lexical and syntactic complexity, and holistic scoring. We also build HSKAgent, fine-tuned on 10K learner compositions. Extensive experimental results demonstrate that HSKBenchmark not only models Chinese SLA effectively, but also serves as a reliable benchmark for dynamic writing assessment in LLMs. Our fine-tuned LLMs have writing performance on par with advanced human learners and exhibit human-like acquisition characteristics. The HSKBenchmark, HSKAgent, and checkpoints serve as foundational tools and resources, with the potential to pave the way for future research on language acquisition modeling and LLMs interpretability. Code and data are publicly available at: https://github.com/CharlesYang030/HSKB.

[30] Tokenisation over Bounded Alphabets is Hard

Violeta Kastreva, Philip Whittington, Dennis Komm, Tiago Pimentel

Main category: cs.CL

TL;DR: Tokenisation is NP-complete even with bounded alphabets (binary and unary), proving computational intractability is fundamental and not due to large alphabets.

DetailsMotivation: Previous works assumed unbounded alphabets, but practical tokenisers use fixed-size alphabets like bytes or Unicode characters. This research closes the gap by analyzing tokenisation over bounded alphabets.

Method: Analyzed two tokenisation variants (bottom-up and direct) over bounded n-ary alphabets, proving hardness results for binary and unary alphabets using computational complexity theory.

Result: Both tokenisation variants are NP-complete even with binary alphabets and admit no polynomial-time approximation scheme. Direct tokenisation remains NP-complete even with unary alphabets.

Conclusion: The computational intractability of tokenisation is fundamental, explaining why practical algorithms are heuristic and pointing toward approximation algorithms as important future research direction.

Abstract: Recent works have shown that tokenisation is NP-complete. However, these works assume tokenisation is applied to inputs with unboundedly large alphabets – an unrealistic assumption, given that in practice tokenisers operate over fixed-size alphabets, such as bytes or Unicode characters. We close this gap by analysing tokenisation over bounded $n$-ary alphabets, considering two natural variants: bottom-up tokenisation and direct tokenisation, where we must, respectively, select a sequence of merge operations or a vocabulary whose application optimally compresses a dataset. First, we note that proving hardness results for an $n$-ary alphabet proves the same results for alphabets of any larger size. We then prove that even with binary alphabets, both variants are not only NP-complete, but admit no polynomial-time approximation scheme (unless P=NP). We further show that direct tokenisation remains NP-complete even when applied to unary alphabets. While unary alphabets may not be practically useful, this result establishes that the computational intractability of tokenisation is not an artifact of large alphabets or complex constructions, but a fundamental barrier. Overall, our results explain why practical algorithms such as BPE and UnigramLM are heuristic, and points toward approximation algorithms being an important path going forward for tokenisation research.

[31] MessIRve: A Large-Scale Spanish Information Retrieval Dataset

Francisco Valentini, Viviana Cotik, Damián Furman, Ivan Bercovich, Edgar Altszyler, Juan Manuel Pérez

Main category: cs.CL

TL;DR: MessIRve is a large-scale Spanish information retrieval dataset with 700K queries from Google autocomplete and Wikipedia documents, addressing the lack of Spanish IR resources.

DetailsMotivation: Spanish is the second most spoken native language but has few IR datasets, limiting development of information access tools for Spanish speakers.

Method: Created dataset using Google’s autocomplete API for queries and Wikipedia for documents, with queries reflecting diverse Spanish-speaking regions and dialectal variations.

Result: Produced MessIRve dataset with almost 700,000 queries covering wide variety of topics, with comprehensive dataset description and baseline evaluations of IR models.

Conclusion: MessIRve advances Spanish IR research and improves information access for Spanish speakers by providing a large-scale, regionally diverse dataset.

Abstract: Information retrieval (IR) is the task of finding relevant documents in response to a user query. Although Spanish is the second most spoken native language, there are few Spanish IR datasets, which limits the development of information access tools for Spanish speakers. We introduce MessIRve, a large-scale Spanish IR dataset with almost 700,000 queries from Google’s autocomplete API and relevant documents sourced from Wikipedia. MessIRve’s queries reflect diverse Spanish-speaking regions, unlike other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets. We provide a comprehensive description of the dataset, comparisons with existing datasets, and baseline evaluations of prominent IR models. Our contributions aim to advance Spanish IR research and improve information access for Spanish speakers.

[32] Tomato, Tomahto, Tomate: Do Multilingual Language Models Understand Based on Subword-Level Semantic Concepts?

Crystina Zhang, Jing Lu, Vinh Q. Tran, Tal Schuster, Donald Metzler, Jimmy Lin

Main category: cs.CL

TL;DR: The paper investigates whether multilingual language models understand text based on subword-level semantic concepts by merging semantically similar subwords and evaluating performance on downstream tasks.

DetailsMotivation: To determine if human-like semantic understanding of words transfers to language models, specifically examining subword-level semantic concepts in multilingual models.

Method: Form ‘semantic tokens’ by merging semantically similar subwords and their embeddings, then evaluate updated multilingual language models on five heterogeneous multilingual downstream tasks.

Result: Models with semantic tokens perform well across different tokenizers and model sizes, showing semantic similarities including synonyms and translations across languages. Zero-shot results are on par or better than original models on certain classification tasks.

Conclusion: Shared subword-level semantics may serve as anchors for cross-lingual transfer in multilingual language models.

Abstract: Human understanding of text depends on general semantic concepts of words rather than their superficial forms. To what extent does our human intuition transfer to language models? In this work, we study the degree to which current multilingual language models (mLMs) understand based on subword-level semantic concepts. To this end, we form “semantic tokens” by merging the semantically similar subwords and their embeddings, and evaluate the updated mLMs on five heterogeneous multilingual downstream tasks. Results show that the general shared semantics could get the models a long way in making the predictions on mLMs with different tokenizers and model sizes. Inspections of the grouped subwords show that they exhibit a wide range of semantic similarities, including synonyms and translations across many languages and scripts. Lastly, we find that the zero-shot results with semantic tokens are on par with or even better than the original models on certain classification tasks, suggesting that the shared subword-level semantics may serve as the anchors for cross-lingual transfer.

[33] Newswire Extraction: A pipeline for extracting newswires from newspaper images

Michael McRae

Main category: cs.CL

TL;DR: New pipeline for extracting wire services from newspaper images

DetailsMotivation: Need to identify and extract wire service content (Associated Press, UPI, NEA) from digitized newspaper images

Method: Developed a new pipeline specifically designed for extracting wire services from newspaper image data

Result: Created a working pipeline capable of identifying and extracting wire service content from newspaper images

Conclusion: The proposed pipeline successfully enables extraction of wire services from newspaper images, facilitating content analysis

Abstract: I describe a new pipeline for extracting wire services (e.g., Associated Press, United Press International, Newspaper Enterprise Association) from newspaper images.

[34] Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-based Test Oracles

Zihao Xu, Junchen Ding, Yiling Lou, Kun Zhang, Dong Gong, Yuekang Li

Main category: cs.CL

TL;DR: SmartyPat-Bench is a challenging benchmark for evaluating LLM logical reasoning using real-world Reddit posts with logical fallacies, and SmartyPat is an automated framework that generates high-quality fallacious statements using Prolog rules and LLM refinement.

DetailsMotivation: Existing datasets for evaluating LLM logical reasoning are limited to simplistic, unnatural, or contextually constrained examples, creating a need for more realistic and diverse benchmarks.

Method: Created SmartyPat-Bench from real Reddit posts with detailed fallacy annotations, and developed SmartyPat framework using Prolog rules to generate fallacies and LLMs to refine them into natural language.

Result: SmartyPat produces fallacies comparable to human-generated content and outperforms baseline methods. Experiments show excessive reasoning steps hinder fallacy detection but structured reasoning improves categorization.

Conclusion: The approach provides nuanced insights into LLM capabilities and offers a scalable solution for fallacy generation and evaluation, addressing limitations of manual data collection.

Abstract: Large Language Models (LLMs) have achieved significant progress in language understanding and reasoning. Evaluating and analyzing their logical reasoning abilities has therefore become essential. However, existing datasets and benchmarks are often limited to overly simplistic, unnatural, or contextually constrained examples. In response to the growing demand, we introduce SmartyPat-Bench, a challenging, naturally expressed, and systematically labeled benchmark derived from real-world high-quality Reddit posts containing subtle logical fallacies. Unlike existing datasets and benchmarks, it provides more detailed annotations of logical fallacies and features more diverse data. To further scale up the study and address the limitations of manual data collection and labeling - such as fallacy-type imbalance and labor-intensive annotation - we introduce SmartyPat, an automated framework powered by logic programming-based oracles. SmartyPat utilizes Prolog rules to systematically generate logically fallacious statements, which are then refined into fluent natural-language sentences by LLMs, ensuring precise fallacy representation. Extensive evaluation demonstrates that SmartyPat produces fallacies comparable in subtlety and quality to human-generated content and significantly outperforms baseline methods. Finally, experiments reveal nuanced insights into LLM capabilities, highlighting that while excessive reasoning steps hinder fallacy detection accuracy, structured reasoning enhances fallacy categorization performance.

[35] A Typology of Synthetic Datasets for Dialogue Processing in Clinical Contexts

Steven Bedrick, A. Seza Doğruöz, Sergiu Nisioi

Main category: cs.CL

TL;DR: Overview of synthetic clinical dialogue datasets: creation, evaluation, usage, and a new typology for classifying data synthesis types and degrees.

DetailsMotivation: Synthetic datasets are widely used in clinical contexts due to privacy and data governance challenges, but there's limited theory on how to best use and generalize them.

Method: Provide an overview of synthetic dataset creation and evaluation methods, and propose a novel typology for classifying types and degrees of data synthesis.

Result: The paper presents a comprehensive analysis of synthetic clinical dialogue datasets and introduces a classification framework to facilitate comparison and evaluation.

Conclusion: A systematic typology is needed to better understand, compare, and evaluate synthetic clinical dialogue datasets for improved generalization and application.

Abstract: Synthetic data sets are used across linguistic domains and NLP tasks, particularly in scenarios where authentic data is limited (or even non-existent). One such domain is that of clinical (healthcare) contexts, where there exist significant and long-standing challenges (e.g., privacy, anonymization, and data governance) which have led to the development of an increasing number of synthetic datasets. One increasingly important category of clinical dataset is that of clinical dialogues which are especially sensitive and difficult to collect, and as such are commonly synthesized. While such synthetic datasets have been shown to be sufficient in some situations, little theory exists to inform how they may be best used and generalized to new applications. In this paper, we provide an overview of how synthetic datasets are created, evaluated and being used for dialogue related tasks in the medical domain. Additionally, we propose a novel typology for use in classifying types and degrees of data synthesis, to facilitate comparison and evaluation.

[36] Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models

Zahraa Al Sahili, Ioannis Patras, Matthew Purver

Main category: cs.CL

TL;DR: Multilingual CLIP models exhibit amplified gender and race biases compared to English-only baselines, with bias patterns varying by language resource level and gender marking systems.

DetailsMotivation: To systematically audit social biases in multilingual vision-language models, which promise universal image-text retrieval but have underexplored bias implications across different languages.

Method: Zero-shot evaluation of four multilingual CLIP variants (M-CLIP, NLLB-CLIP, CAPIVARA-CLIP, SigLIP-2) using balanced subsets of FairFace and PATA stereotype suite across ten languages with varying resource availability and morphological gender marking.

Result: All models showed stronger gender bias than English baselines; CAPIVARA-CLIP had largest biases in low-resource languages it targets; shared encoders transferred English stereotypes to gender-neutral languages; SigLIP-2 reduced some biases but amplified crime associations in caption-sparse contexts; gendered languages consistently magnified all bias types.

Conclusion: Aggregated metrics mask language-specific bias hotspots, highlighting the need for fine-grained, language-aware bias evaluation in multilingual VLM research, as multilinguality doesn’t inherently mitigate bias and can amplify it in specific linguistic contexts.

Abstract: Multilingual vision-language models (VLMs) promise universal image-text retrieval, yet their social biases remain underexplored. We perform the first systematic audit of four public multilingual CLIP variants: M-CLIP, NLLB-CLIP, CAPIVARA-CLIP, and the debiased SigLIP-2, covering ten languages that differ in resource availability and morphological gender marking. Using balanced subsets of FairFace and the PATA stereotype suite in a zero-shot setting, we quantify race and gender bias and measure stereotype amplification. Contrary to the intuition that multilinguality mitigates bias, every model exhibits stronger gender skew than its English-only baseline. CAPIVARA-CLIP shows its largest biases precisely in the low-resource languages it targets, while the shared encoder of NLLB-CLIP and SigLIP-2 transfers English gender stereotypes into gender-neutral languages; loosely coupled encoders largely avoid this leakage. Although SigLIP-2 reduces agency and communion skews, it inherits – and in caption-sparse contexts (e.g., Xhosa) amplifies – the English anchor’s crime associations. Highly gendered languages consistently magnify all bias types, yet gender-neutral languages remain vulnerable whenever cross-lingual weight sharing imports foreign stereotypes. Aggregated metrics thus mask language-specific hot spots, underscoring the need for fine-grained, language-aware bias evaluation in future multilingual VLM research.

[37] RAT: Bridging RNN Efficiency and Attention Accuracy via Chunk-based Sequence Modeling

Xiuying Wei, Anunay Yadav, Razvan Pascanu, Caglar Gulcehre

Main category: cs.CL

TL;DR: RAT is a hybrid architecture that combines recurrence within chunks and attention across chunks to bridge the efficiency of RNNs with the capacity of attention, achieving significant speed improvements while maintaining performance.

DetailsMotivation: Transformers face computational bottlenecks from softmax attention, while recurrent models suffer from memory degradation in long contexts and limited fine-grained retrieval.

Method: Partitions input into chunks, applies recurrence within chunks for local dependencies and softmax-based attention across chunks for long-range interactions. Also proposes a hybrid architecture interleaving RAT with local attention.

Result: 7× training speed improvement for 100K sequence length and 9× generation speed at 4K position while maintaining similar performance. Hybrid design further improves inference speed, reduces cache memory, and enhances performance.

Conclusion: RAT successfully bridges RNN efficiency and attention capacity, offering a practical solution for efficient long-range modeling with strong local interactions.

Abstract: Transformers have become the cornerstone of modern large-scale language models, but their reliance on softmax attention poses a computational bottleneck at both training and inference. Recurrent models offer high efficiency, but compressing the full sequence into a fixed-size and holistic representation can suffer from memory degradation in long contexts and limit fine-grained retrieval. To address this, we propose RAT, an intermediate design that bridges the efficiency of RNNs and capacity of attention. RAT partitions the input into chunks, applies recurrence within each chunk for local dependencies, and softmax-based attention across chunks for long-range interactions. This design mitigates memory degradation and enables direct access to distant tokens, while retaining computational efficiency. Empirically, with a chunk size of 16, the RAT block achieves a 7$\times$ improvement in training speed for 100K sequence length and 9$times$ in generation at the 4K position, while maintaining similar performance compared to standard attention. We demonstrate this by training 1.3B parameter models from scratch and performing large-scale evaluations, including short- and long-context benchmarks, as well as supervised fine-tuning~(SFT). We further propose a hybrid architecture that interleaves RAT with local attention. By combining efficient long-range modeling with strong local interactions, this hybrid design not only improves inference speed and reduces cache memory usage, but also consistently enhances performance and shows the overall best results. Code is available at https://github.com/CLAIRE-Labo/RAT.

[38] Investigating Hallucination in Conversations for Low Resource Languages

Amit Das, Md. Najib Hasan, Souvika Sarkar, Zheng Zhang, Fatemeh Jamshidi, Tathagata Bhattacharya, Nilanjana Raychawdhury, Dongji Feng, Vinija Jain, Aman Chadha

Main category: cs.CL

TL;DR: Analysis of LLM hallucinations across Hindi, Farsi, and Mandarin shows significant variation, with few hallucinations in Mandarin but many in Hindi and Farsi.

DetailsMotivation: Addressing LLM hallucination is crucial for reliability, but most research focuses on English; this study extends investigation to Hindi, Farsi, and Mandarin conversational data.

Method: Comprehensive analysis of dataset examining factual and linguistic errors in three languages for GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1 and Qwen-3.

Result: LLMs produce very few hallucinated responses in Mandarin but generate significantly higher number of hallucinations in Hindi and Farsi.

Conclusion: Hallucination patterns vary significantly across languages, with Mandarin showing much better performance than Hindi and Farsi in LLM responses.

Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in generating text that closely resemble human writing. However, they often generate factually incorrect statements, a problem typically referred to as ‘hallucination’. Addressing hallucination is crucial for enhancing the reliability and effectiveness of LLMs. While much research has focused on hallucinations in English, our study extends this investigation to conversational data in three languages: Hindi, Farsi, and Mandarin. We offer a comprehensive analysis of a dataset to examine both factual and linguistic errors in these languages for GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1 and Qwen-3. We found that LLMs produce very few hallucinated responses in Mandarin but generate a significantly higher number of hallucinations in Hindi and Farsi.

[39] ReFactX: Scalable Reasoning with Reliable Facts via Constrained Generation

Riccardo Pozzi, Matteo Palmonari, Andrea Coletta, Luigi Bellomarini, Jens Lehmann, Sahar Vahdati

Main category: cs.CL

TL;DR: ReFactX enables LLMs to access external knowledge without retrievers or auxiliary models using constrained generation with a prefix-tree index of verbalized KG triples.

DetailsMotivation: Address knowledge gaps and hallucinations in LLMs by providing external knowledge access without complex retrieval pipelines, additional models, or high token processing overhead.

Method: Use constrained generation with a pre-built prefix-tree index where KG triples are verbalized as textual facts, tokenized, and indexed for efficient access during inference.

Result: Scales to large knowledge bases (800M facts), adapts to domain-specific data, achieves effective QA results with minimal generation-time overhead.

Conclusion: ReFactX provides a scalable and efficient approach for LLMs to access external knowledge without dependency on retrievers or auxiliary models.

Abstract: Knowledge gaps and hallucinations are persistent challenges for Large Language Models (LLMs), which generate unreliable responses when lacking the necessary information to fulfill user instructions. Existing approaches, such as Retrieval-Augmented Generation (RAG) and tool use, aim to address these issues by incorporating external knowledge. Yet, they rely on additional models or services, resulting in complex pipelines, potential error propagation, and often requiring the model to process a large number of tokens. In this paper, we present a scalable method that enables LLMs to access external knowledge without depending on retrievers or auxiliary models. Our approach uses constrained generation with a pre-built prefix-tree index. Triples from a Knowledge Graph are verbalized in textual facts, tokenized, and indexed in a prefix tree for efficient access. During inference, to acquire external knowledge, the LLM generates facts with constrained generation which allows only sequences of tokens that form an existing fact. We evaluate our proposal on Question Answering and show that it scales to large knowledge bases (800 million facts), adapts to domain-specific data, and achieves effective results. These gains come with minimal generation-time overhead. ReFactX code is available at https://github.com/rpo19/ReFactX.

[40] Towards Alignment-Centric Paradigm: A Survey of Instruction Tuning in Large Language Models

Xudong Han, Junjie Yang, Tianyang Wang, Ziqian Bi, Xinyuan Song, Junfeng Hao, Junhao Song

Main category: cs.CL

TL;DR: This survey provides a comprehensive overview of instruction tuning for large language models, covering data collection, fine-tuning strategies, and evaluation protocols, with focus on aligning LLMs with human intentions and domain-specific requirements.

DetailsMotivation: Instruction tuning is crucial for aligning large language models with human intentions, safety constraints, and domain-specific requirements, making models more effective and reliable.

Method: Categorizes data construction into expert annotation, distillation from larger models, and self-improvement; covers fine-tuning techniques from conventional supervised training to lightweight approaches like LoRA and prefix tuning; examines evaluation protocols for faithfulness, utility, and safety.

Result: The survey identifies distinct trade-offs between quality, scalability, and resource cost across different data collection paradigms, and highlights computational efficiency and model reusability benefits of various fine-tuning approaches.

Conclusion: A closer integration of data, algorithms, and human feedback is essential for advancing instruction-tuned LLMs, with promising directions in automated data generation, adaptive optimization, and robust evaluation frameworks.

Abstract: Instruction tuning is a pivotal technique for aligning large language models (LLMs) with human intentions, safety constraints, and domain-specific requirements. This survey provides a comprehensive overview of the full pipeline, encompassing (i) data collection methodologies, (ii) full-parameter and parameter-efficient fine-tuning strategies, and (iii) evaluation protocols. We categorized data construction into three major paradigms: expert annotation, distillation from larger models, and self-improvement mechanisms, each offering distinct trade-offs between quality, scalability, and resource cost. Fine-tuning techniques range from conventional supervised training to lightweight approaches, such as low-rank adaptation (LoRA) and prefix tuning, with a focus on computational efficiency and model reusability. We further examine the challenges of evaluating faithfulness, utility, and safety across multilingual and multimodal scenarios, highlighting the emergence of domain-specific benchmarks in healthcare, legal, and financial applications. Finally, we discuss promising directions for automated data generation, adaptive optimization, and robust evaluation frameworks, arguing that a closer integration of data, algorithms, and human feedback is essential for advancing instruction-tuned LLMs. This survey aims to serve as a practical reference for researchers and practitioners seeking to design LLMs that are both effective and reliably aligned with human intentions.

[41] On the Alignment of Large Language Models with Global Human Opinion

Yang Liu, Masahiro Kaneko, Chenhui Chu

Main category: cs.CL

TL;DR: This paper investigates how LLMs align with human opinions across different countries, languages, and historical periods using World Values Survey data, finding uneven alignment and that prompt language can effectively steer model opinions.

DetailsMotivation: Existing studies on LLM opinion alignment focus mainly on US demographics and lack global coverage, temporal analysis, and investigation of language influence on opinion steering.

Method: Created an evaluation framework using World Values Survey data to systematically assess LLM opinion alignment across countries, languages, and historical periods worldwide.

Result: LLMs over-align with few countries while under-aligning with most; changing prompt language to match questionnaire language effectively steers alignment; LLMs better align with contemporary opinions than historical ones.

Conclusion: First comprehensive study of LLM opinion alignment across global, language, and temporal dimensions, demonstrating language’s effectiveness in steering model opinions and revealing uneven global alignment patterns.

Abstract: Today’s large language models (LLMs) are capable of supporting multilingual scenarios, allowing users to interact with LLMs in their native languages. When LLMs respond to subjective questions posed by users, they are expected to align with the views of specific demographic groups or historical periods, shaped by the language in which the user interacts with the model. Existing studies mainly focus on researching the opinions represented by LLMs among demographic groups in the United States or a few countries, lacking worldwide country samples and studies on human opinions in different historical periods, as well as lacking discussion on using language to steer LLMs. Moreover, they also overlook the potential influence of prompt language on the alignment of LLMs’ opinions. In this study, our goal is to fill these gaps. To this end, we create an evaluation framework based on the World Values Survey (WVS) to systematically assess the alignment of LLMs with human opinions across different countries, languages, and historical periods around the world. We find that LLMs appropriately or over-align the opinions with only a few countries while under-aligning the opinions with most countries. Furthermore, changing the language of the prompt to match the language used in the questionnaire can effectively steer LLMs to align with the opinions of the corresponding country more effectively than existing steering methods. At the same time, LLMs are more aligned with the opinions of the contemporary population. To our knowledge, our study is the first comprehensive investigation of the topic of opinion alignment in LLMs across global, language, and temporal dimensions. Our code and data are publicly available at https://github.com/ku-nlp/global-opinion-alignment and https://github.com/nlply/global-opinion-alignment.

[42] In-N-Out: A Parameter-Level API Graph Dataset for Tool Agents

Seungkyu Lee, Nalim Kim, Yohan Jo

Main category: cs.CL

TL;DR: The paper introduces In-N-Out, an expert-annotated dataset of API graphs that captures API dependencies from documentation, significantly improving tool agent performance on complex multi-tool tasks.

DetailsMotivation: Tool agents struggle with complex tasks requiring proper API identification and sequencing due to lack of structured understanding of API dependencies.

Method: Convert API documentation into structured API graphs capturing dependencies, and use In-N-Out dataset to train models for better API comprehension and multi-tool query generation.

Result: Using In-N-Out nearly doubles performance on tool retrieval and multi-tool query generation compared to LLMs using documentation alone, with fine-tuned models closing 90% of the performance gap.

Conclusion: Explicit API graphs show promise for tool agents, and In-N-Out serves as a valuable resource for improving API comprehension and multi-tool task execution.

Abstract: Tool agents – LLM-based systems that interact with external APIs – offer a way to execute real-world tasks. However, as tasks become increasingly complex, these agents struggle to identify and call the correct APIs in the proper order. To tackle this problem, we investigate converting API documentation into a structured API graph that captures API dependencies and leveraging it for multi-tool queries that require compositional API calls. To support this, we introduce In-N-Out, the first expert-annotated dataset of API graphs built from two real-world API benchmarks and their documentation. Using In-N-Out significantly improves performance on both tool retrieval and multi-tool query generation, nearly doubling that of LLMs using documentation alone. Moreover, graphs generated by models fine-tuned on In-N-Out close 90% of this gap, showing that our dataset helps models learn to comprehend API documentation and parameter relationships. Our findings highlight the promise of using explicit API graphs for tool agents and the utility of In-N-Out as a valuable resource. We will release the dataset and code publicly.

[43] Bias after Prompting: Persistent Discrimination in Large Language Models

Nivedha Sivakumar, Natalie Mackraz, Samira Khorshidi, Krishna Patel, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff

Main category: cs.CL

TL;DR: Biases from pre-trained LLMs transfer to adapted models through prompting, and current debiasing methods fail to consistently prevent this transfer across models, tasks, and demographics.

DetailsMotivation: To challenge the assumption that biases don't transfer from pre-trained LLMs to adapted models, specifically studying bias transfer through prompt adaptations which are widely used in real-world applications.

Method: Studied bias transfer hypothesis in causal models under prompt adaptations, examining correlations between intrinsic biases and biases after prompt adaptation across demographics and tasks. Evaluated few-shot composition parameters and various prompt-based debiasing strategies.

Result: Found strong correlations between intrinsic biases and biases after prompt adaptation (e.g., gender: rho >= 0.94 in co-reference resolution; age: rho >= 0.98, religion: rho >= 0.69 in question answering). Biases remained strongly correlated across different few-shot parameters (rho >= 0.90). No debiasing strategy consistently reduced bias transfer.

Conclusion: Correcting bias in intrinsic models may be necessary to prevent bias propagation to downstream tasks, as biases transfer through prompting and current mitigation methods are insufficient.

Abstract: A dangerous assumption that can be made from prior work on the bias transfer hypothesis (BTH) is that biases do not transfer from pre-trained large language models (LLMs) to adapted models. We invalidate this assumption by studying the BTH in causal models under prompt adaptations, as prompting is an extremely popular and accessible adaptation strategy used in real-world applications. In contrast to prior work, we find that biases can transfer through prompting and that popular prompt-based mitigation methods do not consistently prevent biases from transferring. Specifically, the correlation between intrinsic biases and those after prompt adaptation remain moderate to strong across demographics and tasks – for example, gender (rho >= 0.94) in co-reference resolution, and age (rho >= 0.98) and religion (rho >= 0.69) in question answering. Further, we find that biases remain strongly correlated when varying few-shot composition parameters, such as sample size, stereotypical content, occupational distribution and representational balance (rho >= 0.90). We evaluate several prompt-based debiasing strategies and find that different approaches have distinct strengths, but none consistently reduce bias transfer across models, tasks or demographics. These results demonstrate that correcting bias, and potentially improving reasoning ability, in intrinsic models may prevent propagation of biases to downstream tasks.

[44] Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

Austin Xu, Xuan-Phi Nguyen, Yilun Zhou, Chien-Sheng Wu, Caiming Xiong, Shafiq Joty

Main category: cs.CL

TL;DR: FARE is a family of foundational automatic reasoning evaluators trained on 2.5M samples across multiple evaluation tasks, achieving state-of-the-art performance and demonstrating strong real-world utility in inference-time reranking and RL training.

DetailsMotivation: To address the need for scalable evaluation during training and test-time by focusing on data scaling rather than just methodology improvements, as recent work has largely focused on applying new techniques like RL while shying away from large-scale data-driven development.

Method: Curated 2.5M samples spanning five evaluation tasks and multiple reasoning domains, then trained FARE evaluators (8B and 20B parameters) using simple iterative rejection-sampling supervised finetuning (SFT).

Result: FARE-8B challenges larger specialized RL-trained evaluators, FARE-20B sets new standard for open-source evaluators surpassing specialized 70B+ evaluators. Achieves near-oracle performance on MATH as rerankers, improves RL-trained model performance by up to 14.1% as verifiers, and FARE-Code outperforms gpt-oss-20B by 65% on test-case quality evaluation.

Conclusion: Data scaling is crucial for developing high-quality automatic evaluators, and FARE demonstrates that simple SFT approaches with large-scale curated data can achieve state-of-the-art performance across diverse evaluation tasks and real-world applications.

Abstract: Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.

[45] Leveraging the Power of Large Language Models in Entity Linking via Adaptive Routing and Targeted Reasoning

Yajie Li, Albert Galimov, Mitra Datta Ganapaneni, Pujitha Thejaswi, De Meng, Priyanshu Kumar, Saloni Potdar

Main category: cs.CL

TL;DR: ARTER is an efficient Entity Linking method that combines candidate generation, context-based scoring, adaptive routing, and selective LLM reasoning to achieve high performance without deep fine-tuning.

DetailsMotivation: Traditional EL methods require large annotated datasets and extensive fine-tuning, while few-shot LLM-based methods are inefficient due to expensive reasoning for all mentions.

Method: Uses a structured pipeline with candidate generation, context-based scoring, adaptive routing to categorize mentions as easy/hard cases, then applies low-computational linking for easy cases and targeted LLM reasoning for hard cases.

Result: Outperforms ReFinED by up to +4.47% (avg +2.53% on 5/6 datasets), performs comparably to full LLM reasoning pipelines while being twice as efficient in LLM token usage.

Conclusion: ARTER achieves state-of-the-art EL performance with significantly improved efficiency through adaptive routing and targeted reasoning strategies.

Abstract: Entity Linking (EL) has traditionally relied on large annotated datasets and extensive model fine-tuning. While recent few-shot methods leverage large language models (LLMs) through prompting to reduce training requirements, they often suffer from inefficiencies due to expensive LLM-based reasoning. ARTER (Adaptive Routing and Targeted Entity Reasoning) presents a structured pipeline that achieves high performance without deep fine-tuning by strategically combining candidate generation, context-based scoring, adaptive routing, and selective reasoning. ARTER computes a small set of complementary signals(both embedding and LLM-based) over the retrieved candidates to categorize contextual mentions into easy and hard cases. The cases are then handled by a low-computational entity linker (e.g. ReFinED) and more expensive targeted LLM-based reasoning respectively. On standard benchmarks, ARTER outperforms ReFinED by up to +4.47%, with an average gain of +2.53% on 5 out of 6 datasets, and performs comparably to pipelines using LLM-based reasoning for all mentions, while being as twice as efficient in terms of the number of LLM tokens.

[46] GlobalRAG: Enhancing Global Reasoning in Multi-hop Question Answering via Reinforcement Learning

Jinchang Luo, Mingquan Cheng, Fan Wan, Ni Li, Xiaoling Xia, Shuangshuang Tian, Tingcheng Bian, Haiwei Wang, Haohuan Fu, Yan Tao

Main category: cs.CL

TL;DR: GlobalRAG is a reinforcement learning framework that enhances multi-hop QA by decomposing questions into subgoals, coordinating retrieval with reasoning, and refining evidence iteratively, achieving significant improvements with less training data.

DetailsMotivation: Current reinforcement learning approaches for RAG in multi-hop QA suffer from absence of global planning to structure multi-step reasoning and unfaithful execution that hinders effective query formulation and consistent use of retrieved evidence.

Method: Proposes GlobalRAG framework with question decomposition into subgoals, retrieval-reasoning coordination, and iterative evidence refinement. Introduces Planning Quality Reward and SubGoal Completion Reward with progressive weight annealing to balance process-oriented and outcome-based objectives.

Result: Extensive experiments show GlobalRAG significantly outperforms strong baselines using only 8k training data (42% of baseline data), achieving average improvements of 14.2% in both EM and F1 on in-domain and out-of-domain benchmarks.

Conclusion: GlobalRAG effectively addresses global planning and faithful execution limitations in multi-hop QA through structured reasoning and reward mechanisms, demonstrating superior performance with reduced training data requirements.

Abstract: Reinforcement learning has recently shown promise in improving retrieval-augmented generation (RAG). Despite these advances, its effectiveness in multi-hop question answering (QA) remains limited by two fundamental limitations: (i) global planning absence to structure multi-step reasoning, and (ii) unfaithful execution, which hinders effective query formulation and consistent use of retrieved evidence. We propose GlobalRAG, a reinforcement learning framework designed to enhance global reasoning in multi-hop QA. GlobalRAG decomposes questions into subgoals, coordinates retrieval with reasoning, and refines evidence iteratively. To guide this process, we introduce Planning Quality Reward and SubGoal Completion Reward, which encourage coherent planning and reliable subgoal execution. In addition, a progressive weight annealing strategy balances process-oriented and outcome-based objectives. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that GlobalRAG significantly outperforms strong baselines while using only 8k training data (42% of the training data used by strong baselines), achieving average improvements of 14.2% in both EM and F1.

[47] Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays

Haowei Hua, Hong Jiao, Xinyi Wang

Main category: cs.CL

TL;DR: Using generative language models with summarization and prompting improves automated scoring of long essays, overcoming BERT’s 512-token limit and increasing QWK from 0.822 to 0.8878.

DetailsMotivation: BERT and its variants have a 512-token limit, which is insufficient for automated scoring of long essays, creating a need for alternative approaches.

Method: Employ generative language models for automated scoring through summarization and prompting techniques.

Result: Significant improvement in scoring accuracy with QWK increasing from 0.822 to 0.8878 on the Learning Agency Lab Automated Essay Scoring 2.0 dataset.

Conclusion: Generative language models with summarization and prompting are effective for automated scoring of long essays, outperforming encoder-based models like BERT.

Abstract: BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays. Thus, this research explores generative language models for automated scoring of long essays via summarization and prompting. The results revealed great improvement of scoring accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab Automated Essay Scoring 2.0 dataset.

[48] Pragmatic Theories Enhance Understanding of Implied Meanings in LLMs

Takuma Sato, Seiya Kawano, Koichiro Yoshino

Main category: cs.CL

TL;DR: Using pragmatic theories as prompts improves language models’ ability to understand implied meanings, achieving up to 9.6% better performance on pragmatic reasoning tasks compared to standard Chain-of-Thought prompting.

DetailsMotivation: Language models need to accurately interpret implied meanings for effective human communication, but current approaches may not fully leverage established pragmatic theories.

Method: Presenting pragmatic theories (Gricean pragmatics, Relevance Theory) as prompts to guide language models through step-by-step reasoning processes for interpreting implied meanings.

Result: Models achieved up to 9.6% higher scores on pragmatic reasoning tasks compared to 0-shot Chain-of-Thought baseline. Even just mentioning theory names improved performance by 1-3% in larger models.

Conclusion: Providing pragmatic theories as prompts is an effective in-context learning approach that significantly enhances language models’ ability to understand implied meanings.

Abstract: The ability to accurately interpret implied meanings plays a crucial role in human communication and language use, and language models are also expected to possess this capability. This study demonstrates that providing language models with pragmatic theories as prompts is an effective in-context learning approach for tasks to understand implied meanings. Specifically, we propose an approach in which an overview of pragmatic theories, such as Gricean pragmatics and Relevance Theory, is presented as a prompt to the language model, guiding it through a step-by-step reasoning process to derive a final interpretation. Experimental results showed that, compared to the baseline, which prompts intermediate reasoning without presenting pragmatic theories (0-shot Chain-of-Thought), our methods enabled language models to achieve up to 9.6% higher scores on pragmatic reasoning tasks. Furthermore, we show that even without explaining the details of pragmatic theories, merely mentioning their names in the prompt leads to a certain performance improvement (around 1-3%) in larger models compared to the baseline.

[49] HalluClean: A Unified Framework to Combat Hallucinations in LLMs

Yaxin Zhao, Yu Zhang

Main category: cs.CL

TL;DR: HalluClean is a lightweight, task-agnostic framework that detects and corrects hallucinations in LLM-generated text using a reasoning-enhanced paradigm without external knowledge or supervised detectors.

DetailsMotivation: LLMs often produce hallucinated content that undermines factual reliability, creating a need for methods to improve factual consistency in LLM outputs.

Method: Uses a reasoning-enhanced paradigm with planning, execution, and revision stages to identify and refine unsupported claims. Employs minimal task-routing prompts for zero-shot generalization across domains without external knowledge sources.

Result: Significantly improves factual consistency and outperforms competitive baselines across five tasks: question answering, dialogue, summarization, math word problems, and contradiction detection.

Conclusion: HalluClean demonstrates potential to enhance the trustworthiness of LLM outputs in real-world applications through its lightweight, task-agnostic approach to hallucination detection and correction.

Abstract: Large language models (LLMs) have achieved impressive performance across a wide range of natural language processing tasks, yet they often produce hallucinated content that undermines factual reliability. To address this challenge, we introduce HalluClean, a lightweight and task-agnostic framework for detecting and correcting hallucinations in LLM-generated text. HalluClean adopts a reasoning-enhanced paradigm, explicitly decomposing the process into planning, execution, and revision stages to identify and refine unsupported claims. It employs minimal task-routing prompts to enable zero-shot generalization across diverse domains, without relying on external knowledge sources or supervised detectors. We conduct extensive evaluations on five representative tasks-question answering, dialogue, summarization, math word problems, and contradiction detection. Experimental results show that HalluClean significantly improves factual consistency and outperforms competitive baselines, demonstrating its potential to enhance the trustworthiness of LLM outputs in real-world applications.

[50] The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages

Francois Meyer, Jan Buys

Main category: cs.CL

TL;DR: Study of how subword segmentation evolves during pretraining and finetuning in language models, analyzing three languages with different morphological characteristics.

DetailsMotivation: To understand the learning dynamics of subword segmentation when it can be dynamically optimized during training rather than being fixed in preprocessing.

Method: Extended subword segmental language model (SSLM) framework to support pretraining and finetuning, trained on three typologically diverse languages: Isi-Xhosa (conjunctive), Setswana (disjunctive), and English (middle ground).

Result: Identified four stages of subword learning, with morphologically complex Isi-Xhosa showing greater instability. During finetuning, subword boundaries shift to become finer-grained.

Conclusion: Learnable subwords offer a promising approach to improve text generation and cross-lingual transfer for low-resource, morphologically complex languages.

Abstract: Subword segmentation is typically applied in preprocessing and stays fixed during training. Alternatively, it can be learned during training to optimise the training objective. In this paper we study the learning dynamics of subword segmentation: if a language model can dynamically optimise tokenisation, how do its subwords evolve during pretraining and finetuning? To explore this, we extend the subword segmental language model (SSLM), a framework for learning subwords during training, to support pretraining and finetuning. We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum: Isi-Xhosa is conjunctive (long word forms composed of many morphemes), Setswana is disjunctive (morphemes written as separate words), and English represents a typological middle ground. We analyse subword dynamics from a linguistic perspective, tracking morphology, productivity, and fertility. We identify four stages of subword learning, with the morphologically complex isi-Xhosa exhibiting greater instability. During finetuning, subword boundaries shift to become finer-grained. Lastly, we show that learnable subwords offers a promising approach to improve text generation and cross-lingual transfer for low-resource, morphologically complex languages.

[51] Where does an LLM begin computing an instruction?

Aditya Pola, Vineeth N. Balasubramanian

Main category: cs.CL

TL;DR: The paper identifies where instruction following begins in language models by measuring when activation interventions change predictions, finding an inflection point called ‘onset’ where early interventions become ineffective.

DetailsMotivation: To understand where in the layer stack instruction following transitions from reading to doing, and to develop a replicable method for locating this transition point across different tasks and model sizes.

Method: Used three simple datasets and their compositions, applied activation patching on minimal-contrast prompt pairs, and measured layer-wise flip rates to identify when substituting activations changes predicted answers.

Result: Found an inflection point (onset) across Llama family models where interventions that change predictions before this point become largely ineffective afterward. Multi-hop compositions showed similar onset locations.

Conclusion: Provides a simple, replicable method to locate where instruction following begins and compare this location across tasks and model sizes.

Abstract: Following an instruction involves distinct sub-processes, such as reading content, reading the instruction, executing it, and producing an answer. We ask where, along the layer stack, instruction following begins, the point where reading gives way to doing. We introduce three simple datasets (Key-Value, Quote Attribution, Letter Selection) and two hop compositions of these tasks. Using activation patching on minimal-contrast prompt pairs, we measure a layer-wise flip rate that indicates when substituting selected residual activations changes the predicted answer. Across models in the Llama family, we observe an inflection point, which we term onset, where interventions that change predictions before this point become largely ineffective afterward. Multi-hop compositions show a similar onset location. These results provide a simple, replicable way to locate where instruction following begins and to compare this location across tasks and model sizes.

[52] Expert-Guided Prompting and Retrieval-Augmented Generation for Emergency Medical Service Question Answering

Xueren Ge, Sahil Murtaza, Anthony Cortez, Homa Alemzadeh

Main category: cs.CL

TL;DR: EMSQA introduces a medical QA dataset with clinical expertise context and proposes Expert-CoT and ExpertRAG methods that improve LLM performance on EMS certification exams by incorporating domain-specific knowledge.

DetailsMotivation: Existing LLMs overlook domain-specific expertise like clinical subject areas and certification levels in medical QA, limiting performance in high-stakes settings.

Method: Created EMSQA dataset (24.3K questions across 10 clinical areas and 4 certification levels) with aligned knowledge bases. Introduced Expert-CoT (subject/level-conditioned chain-of-thought) and ExpertRAG (retrieval from subject-aligned documents and patient data).

Result: Expert-CoT improves up to 2.05% over vanilla CoT. Expert-CoT + ExpertRAG yields up to 4.59% gain over standard RAG. 32B expertise-augmented LLMs pass all EMS certification simulation exams.

Conclusion: Incorporating clinical expertise context through structured prompting and retrieval significantly enhances LLM performance in medical question answering, enabling certification exam success.

Abstract: Large language models (LLMs) have shown promise in medical question answering, yet they often overlook the domain-specific expertise that professionals depend on, such as the clinical subject areas (e.g., trauma, airway) and the certification level (e.g., EMT, Paramedic). Existing approaches typically apply general-purpose prompting or retrieval strategies without leveraging this structured context, limiting performance in high-stakes settings. We address this gap with EMSQA, an 24.3K-question multiple-choice dataset spanning 10 clinical subject areas and 4 certification levels, accompanied by curated, subject area-aligned knowledge bases (40K documents and 2M tokens). Building on EMSQA, we introduce (i) Expert-CoT, a prompting strategy that conditions chain-of-thought (CoT) reasoning on specific clinical subject area and certification level, and (ii) ExpertRAG, a retrieval-augmented generation pipeline that grounds responses in subject area-aligned documents and real-world patient data. Experiments on 4 LLMs show that Expert-CoT improves up to 2.05% over vanilla CoT prompting. Additionally, combining Expert-CoT with ExpertRAG yields up to a 4.59% accuracy gain over standard RAG baselines. Notably, the 32B expertise-augmented LLMs pass all the computer-adaptive EMS certification simulation exams.

[53] Critical or Compliant? The Double-Edged Sword of Reasoning in Chain-of-Thought Explanations

Eunkyu Park, Wesley Hanwen Deng, Vasudha Varadarajan, Mingxi Yan, Gunhee Kim, Maarten Sap, Motahhare Eslami

Main category: cs.CL

TL;DR: Chain-of-Thought explanations in multimodal moral scenarios can both clarify and mislead, as users often equate trust with outcome agreement and confident delivery tones suppress error detection while maintaining reliance.

DetailsMotivation: To study the double-edged role of CoT explanations in fostering transparency versus confirmation bias, particularly how reasoning errors impact user trust and error detection in multimodal moral scenarios.

Method: Systematically perturbing reasoning chains and manipulating delivery tones in vision language models (VLMs), analyzing reasoning errors and their impact on user trust and error detection capabilities.

Result: Two key effects: (1) users equate trust with outcome agreement, sustaining reliance even with flawed reasoning; (2) confident delivery tones suppress error detection while maintaining reliance, showing style can override correctness.

Conclusion: CoT explanations can simultaneously clarify and mislead, highlighting the need for NLP systems to provide explanations that encourage scrutiny and critical thinking rather than blind trust.

Abstract: Explanations are often promoted as tools for transparency, but they can also foster confirmation bias; users may assume reasoning is correct whenever outputs appear acceptable. We study this double-edged role of Chain-of-Thought (CoT) explanations in multimodal moral scenarios by systematically perturbing reasoning chains and manipulating delivery tones. Specifically, we analyze reasoning errors in vision language models (VLMs) and how they impact user trust and the ability to detect errors. Our findings reveal two key effects: (1) users often equate trust with outcome agreement, sustaining reliance even when reasoning is flawed, and (2) the confident tone suppresses error detection while maintaining reliance, showing that delivery styles can override correctness. These results highlight how CoT explanations can simultaneously clarify and mislead, underscoring the need for NLP systems to provide explanations that encourage scrutiny and critical thinking rather than blind trust. All code will be released publicly.

[54] Knowledge-Grounded Agentic Large Language Models for Multi-Hazard Understanding from Reconnaissance Reports

Chenchen Kuai, Zihao Li, Braden Rosen, Stephanie Paal, Navid Jafari, Jean-Louis Briaud, Yunlong Zhang, Youssef M. A. Hashash, Yang Zhou

Main category: cs.CL

TL;DR: MoRA-RAG is a knowledge-grounded LLM framework that transforms unstructured post-disaster reconnaissance reports into structured knowledge for multi-hazard reasoning, achieving 94.5% accuracy and reducing hallucinations.

DetailsMotivation: Post-disaster reconnaissance reports contain critical evidence for multi-hazard interactions but their unstructured narratives make systematic knowledge transfer difficult. LLMs can analyze these reports but often generate unreliable outputs without domain grounding.

Method: Introduces Mixture-of-Retrieval Agentic RAG (MoRA-RAG) with dynamic query routing across hazard-specific databases, agentic chunking for contextual coherence, and verification loops for evidence assessment and query refinement.

Result: Achieves 94.5% accuracy on HazardRecQA dataset (derived from 90 global events across 7 hazard types), outperforming zero-shot LLMs by 30% and state-of-the-art RAG systems by 10%, while reducing hallucinations across LLM architectures.

Conclusion: MoRA-RAG establishes a new paradigm for transforming post-disaster documentation into actionable, trustworthy intelligence for hazard resilience, enabling open-weight LLMs to match proprietary model performance.

Abstract: Post-disaster reconnaissance reports contain critical evidence for understanding multi-hazard interactions, yet their unstructured narratives make systematic knowledge transfer difficult. Large language models (LLMs) offer new potential for analyzing these reports, but often generate unreliable or hallucinated outputs when domain grounding is absent. This study introduces the Mixture-of-Retrieval Agentic RAG (MoRA-RAG), a knowledge-grounded LLM framework that transforms reconnaissance reports into a structured foundation for multi-hazard reasoning. The framework integrates a Mixture-of-Retrieval mechanism that dynamically routes queries across hazard-specific databases while using agentic chunking to preserve contextual coherence during retrieval. It also includes a verification loop that assesses evidence sufficiency, refines queries, and initiates targeted searches when information remains incomplete. We construct HazardRecQA by deriving question-answer pairs from GEER reconnaissance reports, which document 90 global events across seven major hazard types. MoRA-RAG achieves up to 94.5 percent accuracy, outperforming zero-shot LLMs by 30 percent and state-of-the-art RAG systems by 10 percent, while reducing hallucinations across diverse LLM architectures. MoRA-RAG also enables open-weight LLMs to achieve performance comparable to proprietary models. It establishes a new paradigm for transforming post-disaster documentation into actionable, trustworthy intelligence for hazard resilience.

[55] Based on Data Balancing and Model Improvement for Multi-Label Sentiment Classification Performance Enhancement

Zijin Su, Huanzhu Lyu, Yuren Niu, Yiming Liu

Main category: cs.CL

TL;DR: Created a balanced multi-label sentiment dataset and developed an enhanced classification model that significantly outperforms models trained on imbalanced data.

DetailsMotivation: Existing multi-label sentiment datasets like GoEmotions suffer from severe class imbalance, which hampers model performance for underrepresented emotions.

Method: Constructed balanced dataset by integrating GoEmotions, Sentiment140 samples labeled with RoBERTa-base-GoEmotions, and GPT-4 mini generated texts. Built model with FastText embeddings, CNN for local features, BiLSTM for context, attention mechanism, and sigmoid output layer with mixed precision training.

Result: Experimental results show significant improvements in accuracy, precision, recall, F1-score, and AUC compared to models trained on imbalanced data.

Conclusion: The approach effectively addresses class imbalance in multi-label sentiment classification and demonstrates substantial performance improvements.

Abstract: Multi-label sentiment classification plays a vital role in natural language processing by detecting multiple emotions within a single text. However, existing datasets like GoEmotions often suffer from severe class imbalance, which hampers model performance, especially for underrepresented emotions. To address this, we constructed a balanced multi-label sentiment dataset by integrating the original GoEmotions data, emotion-labeled samples from Sentiment140 using a RoBERTa-base-GoEmotions model, and manually annotated texts generated by GPT-4 mini. Our data balancing strategy ensured an even distribution across 28 emotion categories. Based on this dataset, we developed an enhanced multi-label classification model that combines pre-trained FastText embeddings, convolutional layers for local feature extraction, bidirectional LSTM for contextual learning, and an attention mechanism to highlight sentiment-relevant words. A sigmoid-activated output layer enables multi-label prediction, and mixed precision training improves computational efficiency. Experimental results demonstrate significant improvements in accuracy, precision, recall, F1-score, and AUC compared to models trained on imbalanced data, highlighting the effectiveness of our approach.

[56] ConInstruct: Evaluating Large Language Models on Conflict Detection and Resolution in Instructions

Xingwei He, Qianru Zhang, Pengfei Chen, Guanhua Chen, Linlin Yu, Yuan Yuan, Siu-Ming Yiu

Main category: cs.CL

TL;DR: ConInstruct benchmark reveals LLMs can detect instruction conflicts but rarely notify users or request clarification.

DetailsMotivation: Existing LLM evaluation focuses on instruction adherence but overlooks scenarios with conflicting constraints in complex prompts.

Method: Introduce ConInstruct benchmark to assess LLMs’ conflict detection and resolution abilities through systematic evaluation.

Result: Proprietary LLMs show strong conflict detection (DeepSeek-R1: 91.5%, Claude-4.5-Sonnet: 87.3% F1), but rarely notify users about conflicts or request clarification.

Conclusion: Current LLMs have critical shortcomings in handling conflicting instructions, highlighting an important area for future improvement in instruction-following models.

Abstract: Instruction-following is a critical capability of Large Language Models (LLMs). While existing works primarily focus on assessing how well LLMs adhere to user instructions, they often overlook scenarios where instructions contain conflicting constraints-a common occurrence in complex prompts. The behavior of LLMs under such conditions remains under-explored. To bridge this gap, we introduce ConInstruct, a benchmark specifically designed to assess LLMs’ ability to detect and resolve conflicts within user instructions. Using this dataset, we evaluate LLMs’ conflict detection performance and analyze their conflict resolution behavior. Our experiments reveal two key findings: (1) Most proprietary LLMs exhibit strong conflict detection capabilities, whereas among open-source models, only DeepSeek-R1 demonstrates similarly strong performance. DeepSeek-R1 and Claude-4.5-Sonnet achieve the highest average F1-scores at 91.5% and 87.3%, respectively, ranking first and second overall. (2) Despite their strong conflict detection abilities, LLMs rarely explicitly notify users about the conflicts or request clarification when faced with conflicting constraints. These results underscore a critical shortcoming in current LLMs and highlight an important area for future improvement when designing instruction-following LLMs.

[57] MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents

Jinru Ding, Lu Lu, Chao Ding, Mouxiao Bian, Jiayuan Chen, Wenrao Pang, Ruiyao Chen, Xinwei Peng, Renjie Lu, Sijie Ren, Guanxu Zhu, Xiaoqin Wu, Zhiqiang Liu, Rongzhao Zhang, Luyi Jiang, Bing Han, Yunqiu Wang, Jie Xu

Main category: cs.CL

TL;DR: MedBench v4 is a comprehensive medical AI benchmarking platform with 700,000+ expert-curated tasks across 24 specialties, evaluating LLMs, multimodal models, and agents. Results show base LLMs score 54.1/100 (best: Claude Sonnet 4.5, 62.5/100) with poor safety (18.4/100), while agents significantly improve performance to 79.8/100.

DetailsMotivation: To create evaluation frameworks that reflect real clinical workflows and safety constraints for advancing medical LLMs, multimodal models, and agents, addressing the need for practical benchmarking aligned with clinical guidelines.

Method: Developed a nationwide, cloud-based benchmarking infrastructure with 700,000+ expert-curated tasks across 24 specialties, using multi-stage refinement and multi-round clinician review from 500+ institutions, with LLM-as-a-judge scoring calibrated to human ratings.

Result: Base LLMs scored mean 54.1/100 (Claude Sonnet 4.5 best at 62.5/100) with poor safety (18.4/100). Multimodal models performed worse (mean 47.5/100, GPT-5 best at 54.9/100). Agents significantly improved performance to mean 79.8/100, with Claude Sonnet 4.5-based agents achieving 85.3/100 overall and 88.9/100 on safety.

Conclusion: MedBench v4 reveals persistent gaps in multimodal reasoning and safety for base models, while showing that governance-aware agentic orchestration can markedly enhance clinical readiness without sacrificing capability, providing practical reference for medical AI auditing.

Abstract: Recent advances in medical large language models (LLMs), multimodal models, and agents demand evaluation frameworks that reflect real clinical workflows and safety constraints. We present MedBench v4, a nationwide, cloud-based benchmarking infrastructure comprising over 700,000 expert-curated tasks spanning 24 primary and 91 secondary specialties, with dedicated tracks for LLMs, multimodal models, and agents. Items undergo multi-stage refinement and multi-round review by clinicians from more than 500 institutions, and open-ended responses are scored by an LLM-as-a-judge calibrated to human ratings. We evaluate 15 frontier models. Base LLMs reach a mean overall score of 54.1/100 (best: Claude Sonnet 4.5, 62.5/100), but safety and ethics remain low (18.4/100). Multimodal models perform worse overall (mean 47.5/100; best: GPT-5, 54.9/100), with solid perception yet weaker cross-modal reasoning. Agents built on the same backbones substantially improve end-to-end performance (mean 79.8/100), with Claude Sonnet 4.5-based agents achieving up to 85.3/100 overall and 88.9/100 on safety tasks. MedBench v4 thus reveals persisting gaps in multimodal reasoning and safety for base models, while showing that governance-aware agentic orchestration can markedly enhance benchmarked clinical readiness without sacrificing capability. By aligning tasks with Chinese clinical guidelines and regulatory priorities, the platform offers a practical reference for hospitals, developers, and policymakers auditing medical AI.

cs.CV

[58] Gaussian See, Gaussian Do: Semantic 3D Motion Transfer from Multiview Video

Yarin Bekor, Gal Michael Harari, Or Perel, Or Litany

Main category: cs.CV

TL;DR: Gaussian See, Gaussian Do enables semantic 3D motion transfer from multiview video using dynamic 3D Gaussian Splatting with anchor-based view-aware motion embeddings for cross-view consistency.

DetailsMotivation: To achieve rig-free, cross-category motion transfer between objects with semantically meaningful correspondence without requiring complex rigging setups.

Method: Extract motion embeddings from source videos via condition inversion, apply to static target shapes, use resulting videos to supervise dynamic 3D Gaussian Splatting reconstruction with anchor-based view-aware motion embedding mechanism.

Result: Established first benchmark for semantic 3D motion transfer and demonstrated superior motion fidelity and structural consistency compared to adapted baselines.

Conclusion: The approach enables effective semantic 3D motion transfer with improved cross-view consistency and accelerated convergence through the proposed motion embedding and reconstruction pipeline.

Abstract: We present Gaussian See, Gaussian Do, a novel approach for semantic 3D motion transfer from multiview video. Our method enables rig-free, cross-category motion transfer between objects with semantically meaningful correspondence. Building on implicit motion transfer techniques, we extract motion embeddings from source videos via condition inversion, apply them to rendered frames of static target shapes, and use the resulting videos to supervise dynamic 3D Gaussian Splatting reconstruction. Our approach introduces an anchor-based view-aware motion embedding mechanism, ensuring cross-view consistency and accelerating convergence, along with a robust 4D reconstruction pipeline that consolidates noisy supervision videos. We establish the first benchmark for semantic 3D motion transfer and demonstrate superior motion fidelity and structural consistency compared to adapted baselines. Code and data for this paper available at https://gsgd-motiontransfer.github.io/

[59] When CNNs Outperform Transformers and Mambas: Revisiting Deep Architectures for Dental Caries Segmentation

Aashish Ghimire, Jun Zeng, Roshan Paudel, Nikhil Kumar Tomar, Deepak Ranjan Nayak, Harshith Reddy Nalla, Vivek Jha, Glenda Reynolds, Debesh Jha

Main category: cs.CV

TL;DR: CNN-based DoubleU-Net outperformed transformer and Mamba architectures for dental caries segmentation on panoramic radiographs, achieving the highest dice coefficient of 0.7345 despite the trend toward complex attention-based models.

DetailsMotivation: Automated dental caries segmentation is challenging due to low lesion contrast, morphological variability, and limited annotated data. There's a need to benchmark different neural network architectures for this medical imaging task.

Method: Comprehensive benchmarking of 12 state-of-the-art architectures (CNN, vision transformers, state-space mamba) trained under identical configurations on DC1000 dataset for dental caries segmentation.

Result: CNN-based DoubleU-Net achieved the highest performance (dice: 0.7345, mIoU: 0.5978, precision: 0.8145), outperforming all transformer and Mamba variants. Top 3 results across all metrics were CNN-based architectures.

Conclusion: Architecture-task alignment is more important than model complexity in domain-specific medical image segmentation. Transformer and Mamba methods underperformed due to limited data and weaker spatial priors despite theoretical advantages in global context modeling.

Abstract: Accurate identification and segmentation of dental caries in panoramic radiographs are critical for early diagnosis and effective treatment planning. Automated segmentation remains challenging due to low lesion contrast, morphological variability, and limited annotated data. In this study, we present the first comprehensive benchmarking of convolutional neural networks, vision transformers and state-space mamba architectures for automated dental caries segmentation on panoramic radiographs through a DC1000 dataset. Twelve state-of-the-art architectures, including VMUnet, MambaUNet, VMUNetv2, RMAMamba-S, TransNetR, PVTFormer, DoubleU-Net, and ResUNet++, were trained under identical configurations. Results reveal that, contrary to the growing trend toward complex attention based architectures, the CNN-based DoubleU-Net achieved the highest dice coefficient of 0.7345, mIoU of 0.5978, and precision of 0.8145, outperforming all transformer and Mamba variants. In the study, the top 3 results across all performance metrics were achieved by CNN-based architectures. Here, Mamba and transformer-based methods, despite their theoretical advantage in global context modeling, underperformed due to limited data and weaker spatial priors. These findings underscore the importance of architecture-task alignment in domain-specific medical image segmentation more than model complexity. Our code is available at: https://github.com/JunZengz/dental-caries-segmentation.

[60] B-Rep Distance Functions (BR-DF): How to Represent a B-Rep Model by Volumetric Distance Functions?

Fuyang Zhang, Pradeep Kumar Jayaraman, Xiang Xu, Yasutaka Furukawa

Main category: cs.CV

TL;DR: BR-DF is a novel geometric representation for CAD B-Rep models using volumetric distance functions, enabling 100% success rate in generating watertight B-Rep models through a modified Marching Cubes algorithm.

DetailsMotivation: To create a robust CAD B-Rep representation that guarantees successful model generation without failures, addressing limitations in existing methods.

Method: Encodes CAD geometry as signed distance functions (SDF) and topology as per-face unsigned distance functions (UDF), with a multi-branch latent diffusion model using 3D U-Net backbone for joint generation.

Result: Achieves comparable performance to state-of-the-art methods while reaching unprecedented 100% success rate in producing faceted B-Rep models.

Conclusion: BR-DF provides a reliable geometric representation for CAD B-Rep that ensures guaranteed model generation success through volumetric distance function encoding.

Abstract: This paper presents a novel geometric representation for CAD Boundary Representation (B-Rep) based on volumetric distance functions, dubbed B-Rep Distance Functions (BR-DF). BR-DF encodes the surface mesh geometry of a CAD model as signed distance function (SDF). B-Rep vertices, edges, faces and their topology information are encoded as per-face unsigned distance functions (UDFs). An extension of the Marching Cubes algorithm converts BR-DF directly into watertight CAD B-Rep model (strictly speaking a faceted B-Rep model). A surprising characteristic of BR-DF is that this conversion process never fails. Leveraging the volumetric nature of BR-DF, we propose a multi-branch latent diffusion with 3D U-Net backbone for jointly generating the SDF and per-face UDFs of a BR-DF model. Our approach achieves comparable CAD generation performance against SOTA methods while reaching the unprecedented 100% success rate in producing (faceted) B-Rep models.

[61] CPSL: Representing Volumetric Video via Content-Promoted Scene Layers

Kaiyuan Hu, Yili Jin, Junhua Liu, Xize Duan, Hong Kang, Xue Liu

Main category: cs.CV

TL;DR: CPSL is a compact 2.5D video representation that enables volumetric-like experiences from 2D content using geometry-consistent layers with soft alpha bands and edge-depth cache for efficient novel-view synthesis.

DetailsMotivation: Existing volumetric video representations are costly in capture, computation, and rendering, limiting scalability for on-demand video and real-time communication.

Method: Decomposes frames into geometry-consistent layers using depth and content saliency guidance, with soft alpha bands and edge-depth cache for occlusion and boundary preservation. Uses depth-weighted warping and alpha compositing for novel-view synthesis.

Result: Achieves superior perceptual quality and boundary fidelity compared to baselines while reducing storage and rendering cost by several folds.

Conclusion: CPSL offers a practical path from 2D video to scalable 2.5D immersive media with real-time playback capabilities.

Abstract: Volumetric video enables immersive and interactive visual experiences by supporting free viewpoint exploration and realistic motion parallax. However, existing volumetric representations from explicit point clouds to implicit neural fields, remain costly in capture, computation, and rendering, which limits their scalability for on-demand video and reduces their feasibility for real-time communication. To bridge this gap, we propose Content-Promoted Scene Layers (CPSL), a compact 2.5D video representation that brings the perceptual benefits of volumetric video to conventional 2D content. Guided by per-frame depth and content saliency, CPSL decomposes each frame into a small set of geometry-consistent layers equipped with soft alpha bands and an edge-depth cache that jointly preserve occlusion ordering and boundary continuity. These lightweight, 2D-encodable assets enable parallax-corrected novel-view synthesis via depth-weighted warping and front-to-back alpha compositing, bypassing expensive 3D reconstruction. Temporally, CPSL maintains inter-frame coherence using motion-guided propagation and per-layer encoding, supporting real-time playback with standard video codecs. Across multiple benchmarks, CPSL achieves superior perceptual quality and boundary fidelity compared with layer-based and neural-field baselines while reducing storage and rendering cost by several folds. Our approach offer a practical path from 2D video to scalable 2.5D immersive media.

[62] GeoSceneGraph: Geometric Scene Graph Diffusion Model for Text-guided 3D Indoor Scene Synthesis

Antonio Ruiz, Tao Wu, Andrew Melnik, Qing Cheng, Xuqin Wang, Lu Liu, Yongliang Wang, Yanfeng Zhang, Helge Ritter

Main category: cs.CV

TL;DR: GeoSceneGraph is a method for synthesizing 3D indoor scenes from text prompts that leverages scene graph structure and geometric symmetries without requiring predefined relationship classes or ground-truth annotations.

DetailsMotivation: Existing 3D scene synthesis methods either overlook scene graph structure (limiting coherence) or require inconvenient user-provided graphs/ground-truth relationships (limiting flexibility). There's also a need for efficient models for resource-constrained devices.

Method: Uses equivariant graph neural networks (EGNNs) with a novel text conditioning strategy to handle complex text features, leveraging scene graph structure and geometric symmetries without predefined relationship classes.

Result: Achieves performance comparable to methods that use ground-truth relationships, despite not using them. The text conditioning strategy is validated through ablation studies.

Conclusion: GeoSceneGraph provides an effective approach for text-to-3D scene synthesis that balances scene coherence with flexibility, working without predefined relationship constraints while maintaining competitive performance.

Abstract: Methods that synthesize indoor 3D scenes from text prompts have wide-ranging applications in film production, interior design, video games, virtual reality, and synthetic data generation for training embodied agents. Existing approaches typically either train generative models from scratch or leverage vision-language models (VLMs). While VLMs achieve strong performance, particularly for complex or open-ended prompts, smaller task-specific models remain necessary for deployment on resource-constrained devices such as extended reality (XR) glasses or mobile phones. However, many generative approaches that train from scratch overlook the inherent graph structure of indoor scenes, which can limit scene coherence and realism. Conversely, methods that incorporate scene graphs either demand a user-provided semantic graph, which is generally inconvenient and restrictive, or rely on ground-truth relationship annotations, limiting their capacity to capture more varied object interactions. To address these challenges, we introduce GeoSceneGraph, a method that synthesizes 3D scenes from text prompts by leveraging the graph structure and geometric symmetries of 3D scenes, without relying on predefined relationship classes. Despite not using ground-truth relationships, GeoSceneGraph achieves performance comparable to methods that do. Our model is built on equivariant graph neural networks (EGNNs), but existing EGNN approaches are typically limited to low-dimensional conditioning and are not designed to handle complex modalities such as text. We propose a simple and effective strategy for conditioning EGNNs on text features, and we validate our design through ablation studies.

[63] An Event-triggered System for Social Persuasion and Danger Alert in Elder Home Monitoring

Jun-Yi Liu, Chung-Hao Chen, Ya-Chi Tsao, Ssu-Yao Wu, Yu-Ting Tsao, Lyn Chao-ling Chen

Main category: cs.CV

TL;DR: An event-triggered system using GMM background modeling and SVM machine learning to monitor elders’ physical/mental states through three event types (watch dog, danger notice, photo link) with intuitive social media communication.

DetailsMotivation: To monitor both physical and mental states of elders while addressing their lack of technical experience through intuitive, life-like interactions.

Method: Event-triggered system with GMM background modeling for motion detection, SVM for image analysis, and intuitive social media communication designed for elders.

Result: System successfully detected three event types in home scenarios across 5 families, enabling communication via captured life activities.

Conclusion: The developed system effectively monitors elder wellbeing through automated event detection and facilitates family communication using intuitive, non-technical interfaces.

Abstract: In the study, the physical state and mental state of elders are both considered, and an event-triggered system has developed to detect events: watch dog, danger notice and photo link. By adopting GMM background modeling, the motion behavior of visitors and elders can be detected in the watch dog event and danger notice event respectively. Experiments set in home scenarios and 5 families participated in the experiments for detecting and recording three types of events from their life activities. In addition, the captured images were analyzed using SVM machine learning. For lack of technical experiences of elders, an intuitive operation as normal life activity was designed to create communication between elder and relatives via social media.

[64] HULFSynth : An INR based Super-Resolution and Ultra Low-Field MRI Synthesis via Contrast factor estimation

Pranav Indrakanti, Ivor Simpson

Main category: cs.CV

TL;DR: Unsupervised bidirectional MRI synthesizer that converts between High-Field and Ultra-Low Field images using physics-inspired forward modeling and Implicit Neural Representation for super-resolution.

DetailsMotivation: To address the contrast differences between HF and ULF MRIs by leveraging physical principles rather than relying on paired training data, enabling bidirectional synthesis without observed HF data.

Method: Forward model simulates HF to ULF transformation using tissue-type SNR estimation based on target contrast values. For super-resolution, uses INR network to synthesize HF images by predicting tissue-type segmentations and image intensity simultaneously.

Result: WM-GM contrast improved by 52% in synthetic ULF-like images and 37% in 64mT images. Model demonstrated robustness to variations in target contrast, noise, and initial seeding.

Conclusion: The physics-inspired approach successfully enables unsupervised bidirectional MRI synthesis between HF and ULF domains with significant contrast improvements and robustness to parameter variations.

Abstract: We present an unsupervised single image bidirectional Magnetic Resonance Image (MRI) synthesizer that synthesizes an Ultra-Low Field (ULF) like image from a High-Field (HF) magnitude image and vice-versa. Unlike existing MRI synthesis models, our approach is inspired by the physics that drives contrast changes between HF and ULF MRIs. Our forward model simulates a HF to ULF transformation by estimating the tissue-type Signal-to-Noise ratio (SNR) values based on target contrast values. For the Super-Resolution task, we used an Implicit Neural Representation (INR) network to synthesize HF image by simultaneously predicting tissue-type segmentations and image intensity without observed HF data. The proposed method is evaluated using synthetic ULF-like data from generated from standard 3T T$_1$-weighted images for qualitative assessments and paired 3T-64mT T$_1$-weighted images for validation experiments. WM-GM contrast improved by 52% in synthetic ULF-like images and 37% in 64mT images. Sensitivity experiments demonstrated the robustness of our forward model to variations in target contrast, noise and initial seeding.

[65] Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval

Qing Wang, Chong-Wah Ngo, Ee-Peng Lim

Main category: cs.CV

TL;DR: This paper proposes a causal intervention approach to address bias in cross-modal recipe-food image retrieval by treating ingredients as confounders and using backdoor adjustment to improve similarity judgment.

DetailsMotivation: Existing approaches treat recipes as text describing dish appearance, creating bias since food images may not capture all recipe details due to cooking process, presentation, and imaging conditions. Current methods overlook subtle variations that determine retrieval relevance.

Method: Model bias using causal theory, identify ingredients as confounders, apply backdoor adjustment through causal intervention. Propose a plug-and-play neural module (multi-label ingredient classifier) for debiasing in food-to-recipe retrieval.

Result: Achieves oracle performance with MedR=1 across testing data sizes (1K, 10K, 50K) on Recipe1M dataset. Reports new state-of-the-art search performances.

Conclusion: Causal intervention effectively addresses bias in cross-modal recipe-food image retrieval by removing confounding effects of ingredients, leading to significantly improved retrieval performance.

Abstract: This paper addresses the challenges of learning representations for recipes and food images in the cross-modal retrieval problem. As the relationship between a recipe and its cooked dish is cause-and-effect, treating a recipe as a text source describing the visual appearance of a dish for learning representation, as the existing approaches, will create bias misleading image-and-recipe similarity judgment. Specifically, a food image may not equally capture every detail in a recipe, due to factors such as the cooking process, dish presentation, and image-capturing conditions. The current representation learning tends to capture dominant visual-text alignment while overlooking subtle variations that determine retrieval relevance. In this paper, we model such bias in cross-modal representation learning using causal theory. The causal view of this problem suggests ingredients as one of the confounder sources and a simple backdoor adjustment can alleviate the bias. By causal intervention, we reformulate the conventional model for food-to-recipe retrieval with an additional term to remove the potential bias in similarity judgment. Based on this theory-informed formulation, we empirically prove the oracle performance of retrieval on the Recipe1M dataset to be MedR=1 across the testing data sizes of 1K, 10K, and even 50K. We also propose a plug-and-play neural module, which is essentially a multi-label ingredient classifier for debiasing. New state-of-the-art search performances are reported on the Recipe1M dataset.

[66] InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization

Daniel Gilo, Or Litany

Main category: cs.CV

TL;DR: I-Mix2Mix is a framework that uses a multi-view diffusion model to enable consistent 3D scene editing from sparse input views by distilling 2D diffusion model capabilities while maintaining cross-view consistency.

DetailsMotivation: Existing methods for multi-view image editing from sparse views struggle with artifacts and incoherent edits, failing to maintain consistency across different viewpoints.

Method: Proposes InstructMix2Mix which replaces neural field consolidator in SDS with multi-view diffusion student, featuring incremental student updates, specialized teacher noise scheduler, and attention modification for cross-view coherence.

Result: I-Mix2Mix significantly improves multi-view consistency while maintaining high per-frame edit quality compared to existing methods.

Conclusion: The framework successfully enables consistent 3D scene editing from sparse views by leveraging multi-view diffusion models and novel adaptations to prevent degeneration and enhance coherence.

Abstract: We address the task of multi-view image editing from sparse input views, where the inputs can be seen as a mix of images capturing the scene from different viewpoints. The goal is to modify the scene according to a textual instruction while preserving consistency across all views. Existing methods, based on per-scene neural fields or temporal attention mechanisms, struggle in this setting, often producing artifacts and incoherent edits. We propose InstructMix2Mix (I-Mix2Mix), a framework that distills the editing capabilities of a 2D diffusion model into a pretrained multi-view diffusion model, leveraging its data-driven 3D prior for cross-view consistency. A key contribution is replacing the conventional neural field consolidator in Score Distillation Sampling (SDS) with a multi-view diffusion student, which requires novel adaptations: incremental student updates across timesteps, a specialized teacher noise scheduler to prevent degeneration, and an attention modification that enhances cross-view coherence without additional cost. Experiments demonstrate that I-Mix2Mix significantly improves multi-view consistency while maintaining high per-frame edit quality.

[67] UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization

Tiantian Geng, Teng Wang, Jinming Duan, Yanfu Zhang, Weili Guan, Feng Zheng, Ling shao

Main category: cs.CV

TL;DR: UniAV is a unified framework that simultaneously solves temporal action localization, sound event detection, and audio-visual event localization tasks using a shared audio-visual encoder and task-specific experts with a unified language-aware classifier.

DetailsMotivation: Existing methods over-specialize on individual video event localization tasks, neglecting the equal importance of different events for holistic video understanding. There's a need for a unified approach that can handle multiple event localization tasks together.

Method: Proposes UniAV with: 1) Unified audio-visual encoder for generic multi-scale representations, 2) Task-specific experts to capture unique knowledge per task, 3) Unified language-aware classifier using semantic-aligned task prompts instead of separate prediction heads.

Result: UniAV significantly outperforms both single-task models and naive multi-task baselines across all three tasks. Achieves superior or on-par performance compared to state-of-the-art task-specific methods on ActivityNet 1.3, DESED and UnAV-100 benchmarks.

Conclusion: The unified framework effectively learns and shares mutually beneficial knowledge across tasks and modalities, demonstrating impressive open-set ability to localize novel categories while maintaining strong performance across diverse video event localization tasks.

Abstract: Video event localization tasks include temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL). Existing methods tend to over-specialize on individual tasks, neglecting the equal importance of these different events for a complete understanding of video content. In this work, we aim to develop a unified framework to solve TAL, SED and AVEL tasks together to facilitate holistic video understanding. However, it is challenging since different tasks emphasize distinct event characteristics and there are substantial disparities in existing task-specific datasets (size/domain/duration). It leads to unsatisfactory results when applying a naive multi-task strategy. To tackle the problem, we introduce UniAV, a Unified Audio-Visual perception network to effectively learn and share mutually beneficial knowledge across tasks and modalities. Concretely, we propose a unified audio-visual encoder to derive generic representations from multiple temporal scales for videos from all tasks. Meanwhile, task-specific experts are designed to capture the unique knowledge specific to each task. Besides, instead of using separate prediction heads, we develop a novel unified language-aware classifier by utilizing semantic-aligned task prompts, enabling our model to flexibly localize various instances across tasks with an impressive open-set ability to localize novel categories. Extensive experiments demonstrate that UniAV, with its unified architecture, significantly outperforms both single-task models and the naive multi-task baseline across all three tasks. It achieves superior or on-par performances compared to the state-of-the-art task-specific methods on ActivityNet 1.3, DESED and UnAV-100 benchmarks.

[68] Skin-R1: Toward Trustworthy Clinical Reasoning for Dermatological Diagnosis

Zehao Liu, Wejieying Ren, Jipeng Zhang, Tianxiang Zhao, Jingxi Zhu, Xiaoting Li, Vasant G. Honavar

Main category: cs.CV

TL;DR: SkinR1 is a dermatological vision-language model that combines textbook-based reasoning with reinforcement learning to improve diagnostic accuracy and generalization in dermatology.

DetailsMotivation: Current vision-language models for dermatology face limitations due to data heterogeneity, lack of grounded diagnostic rationales, and poor scalability/generalization from small annotated datasets to large sparse ones.

Method: Three-stage approach: 1) Textbook-based reasoning generator creates hierarchy-aware differential diagnosis trajectories, 2) Supervised fine-tuning with these trajectories, 3) Novel reinforcement learning paradigm incorporating disease hierarchy to transfer reasoning patterns to sparse data.

Result: Extensive experiments show SkinR1 achieves superior diagnostic accuracy on multiple dermatology datasets, with ablation studies confirming the importance of the reasoning foundation from supervised fine-tuning.

Conclusion: SkinR1 successfully addresses key limitations in dermatological VLMs by combining expert-level reasoning supervision with scalable reinforcement learning, demonstrating improved trustworthiness and clinical utility.

Abstract: The emergence of vision-language models (VLMs) has opened new possibilities for clinical reasoning and has shown promising performance in dermatological diagnosis. However, their trustworthiness and clinical utility are often limited by three major factors: (1) Data heterogeneity, where diverse datasets lack consistent diagnostic labels and clinical concept annotations; (2) Absence of grounded diagnostic rationales, leading to a scarcity of reliable reasoning supervision; and (3) Limited scalability and generalization, as models trained on small, densely annotated datasets struggle to transfer nuanced reasoning to large, sparsely-annotated ones. To address these limitations, we propose SkinR1, a novel dermatological VLM that combines deep, textbook-based reasoning with the broad generalization capabilities of reinforcement learning (RL). SkinR1 systematically resolves the key challenges through a unified, end-to-end framework. First, we design a textbook-based reasoning generator that synthesizes high-fidelity, hierarchy-aware, and differential-diagnosis (DDx)-informed trajectories, providing reliable expert-level supervision. Second, we leverage the constructed trajectories for supervised fine-tuning (SFT) empowering the model with grounded reasoning ability. Third, we develop a novel RL paradigm that, by incorporating the hierarchical structure of diseases, effectively transfers these grounded reasoning patterns to large-scale, sparse data. Extensive experiments on multiple dermatology datasets demonstrate that SkinR1 achieves superior diagnostic accuracy. The ablation study demonstrates the importance of the reasoning foundation instilled by SFT.

[69] FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding

Zhenshi Li, Weikang Yu, Dilxat Muhtar, Xueliang Zhang, Pengfeng Xiao, Pedram Ghamisi, Xiao Xiang Zhu

Main category: cs.CV

TL;DR: FarSLIP is a fine-grained aligned remote sensing language-image pretraining framework that addresses CLIP’s limited spatial awareness in RS domain through patch-to-patch distillation and CLS token-based region-category alignment, achieving SOTA results on multiple RS tasks.

DetailsMotivation: Current RS-specific CLIP variants inherit CLIP's limited spatial awareness due to two key limitations: underutilized object-level supervision in RS datasets and performance degradation when applying general domain region-text alignment methods directly to RS data.

Method: Constructed MGRS-200k dataset with multi-granularity RS image-text pairs. Proposed FarSLIP framework using patch-to-patch distillation (instead of patch-to-CLS) to align local and global visual cues, and CLS token-based region-category alignment (instead of explicit patch-level alignment) to preserve semantic coherence.

Result: FarSLIP achieves state-of-the-art performance on RS open-vocabulary semantic segmentation, zero-shot classification, and image-text retrieval tasks, demonstrating improved fine-grained vision-language alignment in RS domain.

Conclusion: FarSLIP effectively addresses CLIP’s spatial awareness limitations in RS domain through careful design choices that preserve semantic coherence while enhancing spatial awareness, setting new benchmarks for RS vision-language tasks.

Abstract: As CLIP’s global alignment limits its ability to capture fine-grained details, recent efforts have focused on enhancing its region-text alignment. However, current remote sensing (RS)-specific CLIP variants still inherit this limited spatial awareness. We identify two key limitations behind this: (1) current RS image-text datasets generate global captions from object-level labels, leaving the original object-level supervision underutilized; (2) despite the success of region-text alignment methods in general domain, their direct application to RS data often leads to performance degradation. To address these, we construct the first multi-granularity RS image-text dataset, MGRS-200k, featuring rich object-level textual supervision for RS region-category alignment. We further investigate existing fine-grained CLIP tuning strategies and find that current explicit region-text alignment methods, whether in a direct or indirect way, underperform due to severe degradation of CLIP’s semantic coherence. Building on these, we propose FarSLIP, a Fine-grained Aligned RS Language-Image Pretraining framework. Rather than the commonly used patch-to-CLS self-distillation, FarSLIP employs patch-to-patch distillation to align local and global visual cues, which improves feature discriminability while preserving semantic coherence. Additionally, to effectively utilize region-text supervision, it employs simple CLS token-based region-category alignment rather than explicit patch-level alignment, further enhancing spatial awareness. FarSLIP features improved fine-grained vision-language alignment in RS domain and sets a new state of the art not only on RS open-vocabulary semantic segmentation, but also on image-level tasks such as zero-shot classification and image-text retrieval. Our dataset, code, and models are available at https://github.com/NJU-LHRS/FarSLIP.

[70] Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling

Jiale Liu, Haoming Zhou, Yishu Zhu, Bingzhi Chen, Yuncheng Jiang

Main category: cs.CV

TL;DR: Proposes a unified approach for fine-grained image-text alignment using significance-aware modeling and region-level uncertainty modeling to address limitations in existing methods.

DetailsMotivation: Existing approaches lack robust intra-modal mechanisms for assessing token significance and fail to model fine-grained uncertainty in region-word correspondences, leading to poor generalization in complex scenes.

Method: Incorporates significance-aware and granularity-aware modeling with modality-specific biases, and represents region features as a mixture of Gaussian distributions to capture fine-grained uncertainty.

Result: Achieves state-of-the-art performance on Flickr30K and MS-COCO datasets across various backbone architectures, with enhanced robustness and interpretability.

Conclusion: The proposed unified approach effectively addresses fundamental limitations in fine-grained image-text alignment through significance-aware modeling and uncertainty representation.

Abstract: Fine-grained image-text alignment is a pivotal challenge in multimodal learning, underpinning key applications such as visual question answering, image captioning, and vision-language navigation. Unlike global alignment, fine-grained alignment requires precise correspondence between localized visual regions and textual tokens, often hindered by noisy attention mechanisms and oversimplified modeling of cross-modal relationships. In this work, we identify two fundamental limitations of existing approaches: the lack of robust intra-modal mechanisms to assess the significance of visual and textual tokens, leading to poor generalization in complex scenes; and the absence of fine-grained uncertainty modeling, which fails to capture the one-to-many and many-to-one nature of region-word correspondences. To address these issues, we propose a unified approach that incorporates significance-aware and granularity-aware modeling and region-level uncertainty modeling. Our method leverages modality-specific biases to identify salient features without relying on brittle cross-modal attention, and represents region features as a mixture of Gaussian distributions to capture fine-grained uncertainty. Extensive experiments on Flickr30K and MS-COCO demonstrate that our approach achieves state-of-the-art performance across various backbone architectures, significantly enhancing the robustness and interpretability of fine-grained image-text alignment.

[71] nnMIL: A generalizable multiple instance learning framework for computational pathology

Xiangde Luo, Jinxi Xiang, Yuanfeng Ji, Ruijiang Li

Main category: cs.CV

TL;DR: nnMIL is a multiple-instance learning framework that connects patch-level pathology foundation models to robust slide-level clinical predictions through random sampling and ensemble methods.

DetailsMotivation: Current approaches for aggregating patch-level features from pathology foundation models into slide-level predictions have design limitations that hinder generalizability and reliability in computational pathology.

Method: nnMIL introduces random sampling at patch and feature levels, enabling large-batch optimization and task-aware sampling. It uses a lightweight aggregator with sliding-window inference for ensemble predictions and uncertainty estimation.

Result: Across 40,000 WSIs, 35 clinical tasks, and 4 pathology foundation models, nnMIL consistently outperformed existing MIL methods for disease diagnosis, subtyping, biomarker detection, and prognosis prediction, with strong cross-model generalization and reliable uncertainty quantification.

Conclusion: nnMIL provides a practical and generalizable solution for translating pathology foundation models into clinically meaningful predictions, advancing reliable AI deployment in real-world computational pathology settings.

Abstract: Computational pathology holds substantial promise for improving diagnosis and guiding treatment decisions. Recent pathology foundation models enable the extraction of rich patch-level representations from large-scale whole-slide images (WSIs), but current approaches for aggregating these features into slide-level predictions remain constrained by design limitations that hinder generalizability and reliability. Here, we developed nnMIL, a simple yet broadly applicable multiple-instance learning framework that connects patch-level foundation models to robust slide-level clinical inference. nnMIL introduces random sampling at both the patch and feature levels, enabling large-batch optimization, task-aware sampling strategies, and efficient and scalable training across datasets and model architectures. A lightweight aggregator performs sliding-window inference to generate ensemble slide-level predictions and supports principled uncertainty estimation. Across 40,000 WSIs encompassing 35 clinical tasks and four pathology foundation models, nnMIL consistently outperformed existing MIL methods for disease diagnosis, histologic subtyping, molecular biomarker detection, and pan- cancer prognosis prediction. It further demonstrated strong cross-model generalization, reliable uncertainty quantification, and robust survival stratification in multiple external cohorts. In conclusion, nnMIL offers a practical and generalizable solution for translating pathology foundation models into clinically meaningful predictions, advancing the development and deployment of reliable AI systems in real-world settings.

[72] X-WIN: Building Chest Radiograph World Model via Predictive Sensing

Zefan Yang, Ge Wang, James Hendler, Mannudeep K. Kalra, Pingkun Yan

Main category: cs.CV

TL;DR: X-WIN is a CXR world model that learns 3D anatomical knowledge from CT scans by predicting 2D projections in latent space, improving CXR representation learning and disease diagnosis.

DetailsMotivation: Chest X-rays (CXR) are limited by 2D structural superposition, making it challenging to capture 3D anatomies and perform accurate disease diagnosis.

Method: Proposes X-WIN model that distills volumetric knowledge from CT by predicting 2D projections in latent space, using affinity-guided contrastive alignment and incorporating real CXRs via masked image modeling and domain classification.

Result: X-WIN outperforms existing foundation models on diverse downstream tasks using linear probing and few-shot fine-tuning, and can render 2D projections for 3D CT volume reconstruction.

Conclusion: X-WIN successfully bridges the gap between 2D CXR and 3D CT knowledge, enabling improved representation learning and demonstrating the ability to reconstruct 3D volumes from 2D projections.

Abstract: Chest X-ray radiography (CXR) is an essential medical imaging technique for disease diagnosis. However, as 2D projectional images, CXRs are limited by structural superposition and hence fail to capture 3D anatomies. This limitation makes representation learning and disease diagnosis challenging. To address this challenge, we propose a novel CXR world model named X-WIN, which distills volumetric knowledge from chest computed tomography (CT) by learning to predict its 2D projections in latent space. The core idea is that a world model with internalized knowledge of 3D anatomical structure can predict CXRs under various transformations in 3D space. During projection prediction, we introduce an affinity-guided contrastive alignment loss that leverages mutual similarities to capture rich, correlated information across projections from the same volume. To improve model adaptability, we incorporate real CXRs into training through masked image modeling and employ a domain classifier to encourage statistically similar representations for real and simulated CXRs. Comprehensive experiments show that X-WIN outperforms existing foundation models on diverse downstream tasks using linear probing and few-shot fine-tuning. X-WIN also demonstrates the ability to render 2D projections for reconstructing a 3D CT volume.

[73] Unsupervised Discovery of Long-Term Spatiotemporal Periodic Workflows in Human Activities

Fan Yang, Quanting Xie, Atsunori Moteki, Shoichi Masui, Shan Jiang, Yonatan Bisk, Graham Neubig

Main category: cs.CV

TL;DR: A new benchmark for long-term periodic human workflows with 580 multimodal sequences, featuring three evaluation tasks and a lightweight training-free baseline that outperforms existing methods.

DetailsMotivation: Long-term periodic workflows with low-contrast patterns are underexplored compared to short-term periodic activities, creating a gap in understanding complex human activities in manufacturing, sports, and daily life.

Method: Created a benchmark with 580 multimodal human activity sequences and proposed a lightweight, training-free baseline for modeling diverse periodic workflow patterns without requiring annotation or retraining.

Result: The benchmark presents significant challenges to existing methods, while the proposed baseline substantially outperforms competing methods in all evaluation tasks and shows deployment advantages comparable to supervised approaches.

Conclusion: The work successfully bridges the gap in long-term periodic workflow analysis, providing a challenging benchmark and effective baseline that eliminates the need for annotation and retraining in real-world applications.

Abstract: Periodic human activities with implicit workflows are common in manufacturing, sports, and daily life. While short-term periodic activities – characterized by simple structures and high-contrast patterns – have been widely studied, long-term periodic workflows with low-contrast patterns remain largely underexplored. To bridge this gap, we introduce the first benchmark comprising 580 multimodal human activity sequences featuring long-term periodic workflows. The benchmark supports three evaluation tasks aligned with real-world applications: unsupervised periodic workflow detection, task completion tracking, and procedural anomaly detection. We also propose a lightweight, training-free baseline for modeling diverse periodic workflow patterns. Experiments show that: (i) our benchmark presents significant challenges to both unsupervised periodic detection methods and zero-shot approaches based on powerful large language models (LLMs); (ii) our baseline outperforms competing methods by a substantial margin in all evaluation tasks; and (iii) in real-world applications, our baseline demonstrates deployment advantages on par with traditional supervised workflow detection approaches, eliminating the need for annotation and retraining. Our project page is https://sites.google.com/view/periodicworkflow.

[74] RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems

Jaro Meyer, Frédéric Giraud, Joschua Wüthrich, Marc Pollefeys, Philipp Fürnstahl, Lilian Calvet

Main category: cs.CV

TL;DR: A low-cost LED Clock method for millisecond-level temporal synchronization of heterogeneous multi-view camera systems (RGB and IR) using visual time encoding.

DetailsMotivation: Synchronizing multiple cameras is challenging in heterogeneous setups without hardware sync capabilities, especially in real-world environments where controlled conditions are infeasible.

Method: Uses a custom LED Clock with red and infrared LEDs to encode time, allowing visual decoding of exposure windows from recorded frames for synchronization.

Result: Achieves 1.34ms RMSE residual error, outperforms light-, audio-, and timecode-based approaches, and improves downstream tasks like multi-view pose estimation and 3D reconstruction.

Conclusion: The solution simplifies synchronization pipelines and enables advanced vision-based sensing in unconstrained environments including industrial and clinical applications.

Abstract: Accurate spatiotemporal alignment of multi-view video streams is essential for a wide range of dynamic-scene applications such as multi-view 3D reconstruction, pose estimation, and scene understanding. However, synchronizing multiple cameras remains a significant challenge, especially in heterogeneous setups combining professional and consumer-grade devices, visible and infrared sensors, or systems with and without audio, where common hardware synchronization capabilities are often unavailable. This limitation is particularly evident in real-world environments, where controlled capture conditions are not feasible. In this work, we present a low-cost, general-purpose synchronization method that achieves millisecond-level temporal alignment across diverse camera systems while supporting both visible (RGB) and infrared (IR) modalities. The proposed solution employs a custom-built \textit{LED Clock} that encodes time through red and infrared LEDs, allowing visual decoding of the exposure window (start and end times) from recorded frames for millisecond-level synchronization. We benchmark our method against hardware synchronization and achieve a residual error of 1.34~ms RMSE across multiple recordings. In further experiments, our method outperforms light-, audio-, and timecode-based synchronization approaches and directly improves downstream computer vision tasks, including multi-view pose estimation and 3D reconstruction. Finally, we validate the system in large-scale surgical recordings involving over 25 heterogeneous cameras spanning both IR and RGB modalities. This solution simplifies and streamlines the synchronization pipeline and expands access to advanced vision-based sensing in unconstrained environments, including industrial and clinical applications.

[75] Artificial intelligence approaches for energy-efficient laser cutting machines

Mohamed Abdallah Salem, Hamdy Ahmed Ashour, Ahmed Elshenawy

Main category: cs.CV

TL;DR: Novel deep learning methods reduce laser cutting energy consumption by 20-50% through adaptive closed-loop control of suction pumps based on material classification and smoke detection.

DetailsMotivation: Address high energy consumption and environmental impact in laser cutting, particularly the lack of adaptive control and open-loop nature of CO2 laser suction pumps.

Method: Closed-loop system with deep learning: material classification using lens-less speckle sensing with custom CNN and USB camera with VGG16 transfer learning, plus separate DL model for smoke level detection to dynamically adjust pump power.

Result: Experimentally proven 20% to 50% reduction in smoke suction pump energy consumption through automatic pump halting during inactive times and dynamic power adjustment during operation.

Conclusion: Significant contribution to sustainable development in manufacturing through substantial energy savings in laser cutting processes using adaptive deep learning control systems.

Abstract: This research addresses the significant challenges of energy consumption and environmental impact in laser cutting by proposing novel deep learning (DL) methodologies to achieve energy reduction. Recognizing the current lack of adaptive control and the open-loop nature of CO2 laser suction pumps, this study utilizes closed-loop configurations that dynamically adjust pump power based on both the material being cut and the smoke level generated. To implement this adaptive system, diverse material classification methods are introduced, including techniques leveraging lens-less speckle sensing with a customized Convolutional Neural Network (CNN) and an approach using a USB camera with transfer learning via the pre-trained VGG16 CNN model. Furthermore, a separate DL model for smoke level detection is employed to simultaneously refine the pump’s power output. This integration prompts the exhaust suction pump to automatically halt during inactive times and dynamically adjust power during operation, leading to experimentally proven and remarkable energy savings, with results showing a 20% to 50% reduction in the smoke suction pump’s energy consumption, thereby contributing substantially to sustainable development in the manufacturing sector.

[76] EGSA-PT:Edge-Guided Spatial Attention with Progressive Training for Monocular Depth Estimation and Segmentation of Transparent Objects

Gbenga Omotara, Ramy Farag, Seyed Mohamad Ali Tousi, G. N. DeSouza

Main category: cs.CV

TL;DR: EGSA improves transparent object perception by using edge-guided fusion to reduce negative interactions between semantic and geometric tasks, with multi-modal progressive training that transitions from RGB to depth-based edges.

DetailsMotivation: Transparent object perception is challenging due to transparency confounding depth estimation and segmentation. Multi-task learning often suffers from negative cross-task interactions that hinder performance.

Method: Edge-Guided Spatial Attention (EGSA) fusion mechanism incorporates boundary information between semantic and geometric features. Uses multi-modal progressive training transitioning from RGB-derived edges to depth-derived edges without requiring ground-truth depth.

Result: EGSA consistently improved depth accuracy over state-of-the-art MODEST on Syn-TODD and ClearPose benchmarks, with largest improvements in transparent regions, while maintaining competitive segmentation performance.

Conclusion: Edge-guided fusion is a robust approach for improving transparent object perception, effectively mitigating destructive interactions between semantic and geometric tasks.

Abstract: Transparent object perception remains a major challenge in computer vision research, as transparency confounds both depth estimation and semantic segmentation. Recent work has explored multi-task learning frameworks to improve robustness, yet negative cross-task interactions often hinder performance. In this work, we introduce Edge-Guided Spatial Attention (EGSA), a fusion mechanism designed to mitigate destructive interactions by incorporating boundary information into the fusion between semantic and geometric features. On both Syn-TODD and ClearPose benchmarks, EGSA consistently improved depth accuracy over the current state of the art method (MODEST), while preserving competitive segmentation performance, with the largest improvements appearing in transparent regions. Besides our fusion design, our second contribution is a multi-modal progressive training strategy, where learning transitions from edges derived from RGB images to edges derived from predicted depth images. This approach allows the system to bootstrap learning from the rich textures contained in RGB images, and then switch to more relevant geometric content in depth maps, while it eliminates the need for ground-truth depth at training time. Together, these contributions highlight edge-guided fusion as a robust approach capable of improving transparent object perception.

[77] Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation

Nicholas Cooper, Lijun Chen, Sailesh Dwivedy, Danna Gurari

Main category: cs.CV

TL;DR: Proposes a feature-only knowledge distillation framework without logit-based losses, using geometric analysis to identify optimal teacher layers for distillation, achieving state-of-the-art performance with up to 15% accuracy improvement.

DetailsMotivation: Current KD methods rely on both logit-based and feature-based losses, but this work explores whether feature-based losses alone can effectively transfer knowledge from teacher to student models.

Method: Uses feature-based losses exclusively (no logit losses), introduces a knowledge quality metric based on latent representation geometry to identify the most effective teacher layers for distillation.

Result: Achieves state-of-the-art performance on three image classification datasets with four student-teacher pairs (CNNs and ViTs), delivering top-1 accuracy boosts of up to 15% over standard approaches.

Conclusion: Feature-only knowledge distillation with geometric layer selection is highly effective, outperforming traditional methods that combine logit and feature losses.

Abstract: Knowledge distillation (KD) methods can transfer knowledge of a parameter-heavy teacher model to a light-weight student model. The status quo for feature KD methods is to utilize loss functions based on logits (i.e., pre-softmax class scores) and intermediate layer features (i.e., latent representations). Unlike previous approaches, we propose a feature KD framework for training the student’s backbone using feature-based losses exclusively (i.e., without logit-based losses such as cross entropy). Leveraging recent discoveries about the geometry of latent representations, we introduce a knowledge quality metric for identifying which teacher layers provide the most effective knowledge for distillation. Experiments on three image classification datasets with four diverse student-teacher pairs, spanning convolutional neural networks and vision transformers, demonstrate our KD method achieves state-of-the-art performance, delivering top-1 accuracy boosts of up to 15% over standard approaches. We publically share our code to facilitate future work at https://github.com/Thegolfingocto/KD_wo_CE.

[78] Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Viacheslav Vasilev, Alexey Letunovskiy, Maria Kovaleva, Nikolai Vaulin, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Nikita Kiselev, Alexander Varlamov, Dmitrii Mikhailov, Vladimir Polovnikov, Andrey Shutkin, Ilya Vasiliev, Julia Agafonova, Anastasiia Kargapoltseva, Anna Dmitrienko, Anastasia Maltseva, Anna Averchenkova, Olga Kim, Tatiana Nikulina, Denis Dimitrov

Main category: cs.CV

TL;DR: Kandinsky 5.0 is a family of foundation models for high-resolution image and 10-second video synthesis, featuring three model line-ups with varying parameter sizes and capabilities, supported by comprehensive data curation and training techniques.

DetailsMotivation: To advance the development and accessibility of high-quality generative models by creating a large-scale, publicly available framework that leverages extensive pre-training and quality-enhancement techniques for diverse generative applications.

Method: Multi-stage training pipeline with comprehensive data curation (collection, processing, filtering, clustering), extensive pre-training, and quality-enhancement techniques including self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training, along with novel architectural, training, and inference optimizations.

Result: Achieves high generation speeds and state-of-the-art performance across various tasks as demonstrated by human evaluation, with three core models: 6B parameter image generation, 2B parameter text-to-video/image-to-video, and 19B parameter superior video generation models.

Conclusion: Kandinsky 5.0 represents a significant advancement in generative modeling that can be adapted for a wide range of applications, with open-source code and training checkpoints released to advance research community development and accessibility.

Abstract: This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.

[79] FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation

Yueru He, Xueqing Peng, Yupeng Cao, Yan Wang, Lingfei Qian, Haohang Li, Yi Han, Ruoyu Xiang, Mingquan Lin, Prayag Tiwari, Jimin Huang, Guojun Xiong, Sophia Ananiadou

Main category: cs.CV

TL;DR: FinCriticalED is a visual benchmark for evaluating OCR and vision language models on financial documents at the fact level, focusing on detecting critical errors in numerical and temporal information.

DetailsMotivation: Financial documents contain visually dense layouts where small OCR errors (like sign inversion or shifted dates) can lead to materially different interpretations, while traditional metrics only capture surface-level text similarity.

Method: Created 500 image-HTML pairs with expert-annotated financial facts covering 700+ numerical/temporal facts, developed LLM-as-Judge evaluation pipeline for structured fact extraction and contextual verification.

Result: Strongest proprietary models achieve highest factual accuracy but substantial errors remain in visually intricate numerical and temporal contexts.

Conclusion: FinCriticalED provides rigorous foundation for advancing visual factual precision in financial and other precision-critical domains.

Abstract: We introduce FinCriticalED (Financial Critical Error Detection), a visual benchmark for evaluating OCR and vision language models on financial documents at the fact level. Financial documents contain visually dense and table heavy layouts where numerical and temporal information is tightly coupled with structure. In high stakes settings, small OCR mistakes such as sign inversion or shifted dates can lead to materially different interpretations, while traditional OCR metrics like ROUGE and edit distance capture only surface level text similarity. \ficriticaled provides 500 image-HTML pairs with expert annotated financial facts covering over seven hundred numerical and temporal facts. It introduces three key contributions. First, it establishes the first fact level evaluation benchmark for financial document understanding, shifting evaluation from lexical overlap to domain critical factual correctness. Second, all annotations are created and verified by financial experts with strict quality control over signs, magnitudes, and temporal expressions. Third, we develop an LLM-as-Judge evaluation pipeline that performs structured fact extraction and contextual verification for visually complex financial documents. We benchmark OCR systems, open source vision language models, and proprietary models on FinCriticalED. Results show that although the strongest proprietary models achieve the highest factual accuracy, substantial errors remain in visually intricate numerical and temporal contexts. Through quantitative evaluation and expert case studies, FinCriticalED provides a rigorous foundation for advancing visual factual precision in financial and other precision critical domains.

[80] CKDA: Cross-modality Knowledge Disentanglement and Alignment for Visible-Infrared Lifelong Person Re-identification

Zhenyu Cui, Jiahuan Zhou, Yuxin Peng

Main category: cs.CV

TL;DR: Proposes CKDA method for Visible-Infrared Lifelong person Re-ID to address collaborative forgetting by disentangling modality-specific and modality-common knowledge using prompting modules and cross-modal alignment.

DetailsMotivation: Existing methods for VI-LReID suffer from mutual interference between modality-specific knowledge acquisition and modality-common knowledge anti-forgetting, leading to collaborative forgetting.

Method: Uses Modality-Common Prompting (MCP) and Modality-Specific Prompting (MSP) modules to disentangle knowledge, plus Cross-modal Knowledge Alignment (CKA) module to align new and old knowledge in inter- and intra-modality spaces.

Result: Extensive experiments on four benchmark datasets show effectiveness and superiority over state-of-the-art methods.

Conclusion: CKDA successfully addresses collaborative forgetting in VI-LReID by balanced knowledge disentanglement and alignment.

Abstract: Lifelong person Re-IDentification (LReID) aims to match the same person employing continuously collected individual data from different scenarios. To achieve continuous all-day person matching across day and night, Visible-Infrared Lifelong person Re-IDentification (VI-LReID) focuses on sequential training on data from visible and infrared modalities and pursues average performance over all data. To this end, existing methods typically exploit cross-modal knowledge distillation to alleviate the catastrophic forgetting of old knowledge. However, these methods ignore the mutual interference of modality-specific knowledge acquisition and modality-common knowledge anti-forgetting, where conflicting knowledge leads to collaborative forgetting. To address the above problems, this paper proposes a Cross-modality Knowledge Disentanglement and Alignment method, called CKDA, which explicitly separates and preserves modality-specific knowledge and modality-common knowledge in a balanced way. Specifically, a Modality-Common Prompting (MCP) module and a Modality-Specific Prompting (MSP) module are proposed to explicitly disentangle and purify discriminative information that coexists and is specific to different modalities, avoiding the mutual interference between both knowledge. In addition, a Cross-modal Knowledge Alignment (CKA) module is designed to further align the disentangled new knowledge with the old one in two mutually independent inter- and intra-modality feature spaces based on dual-modality prototypes in a balanced manner. Extensive experiments on four benchmark datasets verify the effectiveness and superiority of our CKDA against state-of-the-art methods. The source code of this paper is available at https://github.com/PKU-ICST-MIPL/CKDA-AAAI2026.

[81] Complex-Valued 2D Gaussian Representation for Computer-Generated Holography

Yicheng Zhan, Xiangjun Gao, Long Quan, Kaan Akşit

Main category: cs.CV

TL;DR: A new hologram representation using structured complex-valued 2D Gaussian primitives reduces parameter space by 10:1, enabling faster optimization with lower VRAM usage while achieving higher-fidelity reconstructions.

DetailsMotivation: To overcome limitations of per-pixel hologram storage and enable more scalable hologram estimation in next-generation computer-generated holography systems by reducing the parameter search space.

Method: Proposes structured complex-valued 2D Gaussian primitives representation with a differentiable rasterizer and GPU-optimized light propagation kernel for end-to-end training, plus conversion to practical hologram formats.

Result: Achieves up to 2.5x lower VRAM usage, 50% faster optimization, higher-fidelity reconstructions, and effective noise artifact suppression in phase-only holograms compared to existing methods.

Conclusion: The representation enables more scalable hologram estimation by significantly reducing parameter search space while maintaining high quality, making it suitable for next-generation holography systems.

Abstract: We propose a new hologram representation based on structured complex-valued 2D Gaussian primitives, which replaces per-pixel information storage and reduces the parameter search space by up to 10:1. To enable end-to-end training, we develop a differentiable rasterizer for our representation, integrated with a GPU-optimized light propagation kernel in free space. Our extensive experiments show that our method achieves up to 2.5x lower VRAM usage and 50% faster optimization while producing higher-fidelity reconstructions than existing methods. We further introduce a conversion procedure that adapts our representation to practical hologram formats, including smooth and random phase-only holograms. Our experiments show that this procedure can effectively suppress noise artifacts observed in previous methods. By reducing the hologram parameter search space, our representation enables a more scalable hologram estimation in the next-generation computer-generated holography systems.

[82] Computer Vision Modeling of the Development of Geometric and Numerical Concepts in Humans

Zekun Wang, Sashank Varma

Main category: cs.CV

TL;DR: Computer vision models show developmental alignment with human mathematical cognition, matching children’s learning trajectories for geometry and number concepts during training.

DetailsMotivation: To investigate whether computer vision models exhibit developmental alignment - whether their performance improvements during training mirror the developmental progressions observed in children's mathematical understanding.

Method: Case study of ResNet-50 model analyzing its learning trajectories for geometric and numerical concepts, comparing model performance across training epochs with human developmental data.

Result: Found developmental alignment for some geometric concepts (Euclidean Geometry, Geometrical Figures, Metric Properties, Topology) but not others (Chiral Figures, Geometric Transformations, Symmetrical Figures). Also found alignment in emergence of mental number line representation.

Conclusion: Computer vision models show promise for understanding human mathematical development and point to future research exploring different architectures and larger benchmarks.

Abstract: Mathematical thinking is a fundamental aspect of human cognition. Cognitive scientists have investigated the mechanisms that underlie our ability to thinking geometrically and numerically, to take two prominent examples, and developmental scientists have documented the trajectories of these abilities over the lifespan. Prior research has shown that computer vision (CV) models trained on the unrelated task of image classification nevertheless learn latent representations of geometric and numerical concepts similar to those of adults. Building on this demonstrated cognitive alignment, the current study investigates whether CV models also show developmental alignment: whether their performance improvements across training to match the developmental progressions observed in children. In a detailed case study of the ResNet-50 model, we show that this is the case. For the case of geometry and topology, we find developmental alignment for some classes of concepts (Euclidean Geometry, Geometrical Figures, Metric Properties, Topology) but not others (Chiral Figures, Geometric Transformations, Symmetrical Figures). For the case of number, we find developmental alignment in the emergence of a human-like ``mental number line’’ representation with experience. These findings show the promise of computer vision models for understanding the development of mathematical understanding in humans. They point the way to future research exploring additional model architectures and building larger benchmarks.

[83] UniHOI: Unified Human-Object Interaction Understanding via Unified Token Space

Panqi Yang, Haodong Jing, Nanning Zheng, Yongqiang Ma

Main category: cs.CV

TL;DR: UniHOI unifies HOI detection and generation through a shared token space, achieving state-of-the-art results with 4.9% accuracy improvement in detection and 42.0% metric boost in generation.

DetailsMotivation: Traditional separation of HOI detection and generation tasks hinders comprehensive interaction understanding, so a unified approach is needed to enable knowledge sharing and improve generalization.

Method: Proposes UniHOI with symmetric interaction-aware attention module and unified semi-supervised learning paradigm, creating bidirectional mapping between images and interaction semantics in a unified token space.

Result: Achieves state-of-the-art performance: 4.9% accuracy improvement on long-tailed HOI detection and 42.0% boost in interaction metrics on open-vocabulary generation tasks.

Conclusion: Joint modeling of HOI detection and generation via unified token space effectively promotes knowledge sharing and enhances generalization capabilities.

Abstract: In the field of human-object interaction (HOI), detection and generation are two dual tasks that have traditionally been addressed separately, hindering the development of comprehensive interaction understanding. To address this, we propose UniHOI, which jointly models HOI detection and generation via a unified token space, thereby effectively promoting knowledge sharing and enhancing generalization. Specifically, we introduce a symmetric interaction-aware attention module and a unified semi-supervised learning paradigm, enabling effective bidirectional mapping between images and interaction semantics even under limited annotations. Extensive experiments demonstrate that UniHOI achieves state-of-the-art performance in both HOI detection and generation. Specifically, UniHOI improves accuracy by 4.9% on long-tailed HOI detection and boosts interaction metrics by 42.0% on open-vocabulary generation tasks.

[84] Hyperspectral Super-Resolution with Inter-Image Variability via Degradation-based Low-Rank and Residual Fusion Method

Yue Wen, Kunjing Yang, Minru Bai

Main category: cs.CV

TL;DR: Proposes DLRRF model to handle spectral variability and spatial changes in HSI-MSI fusion by modeling spectral degradation changes and decomposing target HSI into low-rank and residual components.

DetailsMotivation: Inter-image variability (spectral variability and spatially localized changes) between HSI and MSI significantly affects fusion performance, and existing methods that apply direct transformations exacerbate model ill-posedness.

Method: Models spectral variability as changes in spectral degradation operator, decomposes target HSI into low-rank and residual components for detail recovery, uses dimensionality reduction via spectral correlation, and employs implicit regularization with PAO algorithm in PnP framework.

Result: Extensive numerical experiments demonstrate superior performance in fusing HSI and MSI with inter-image variability compared to existing methods.

Conclusion: DLRRF effectively addresses inter-image variability challenges in HSI-MSI fusion through degradation-based modeling and component decomposition, achieving enhanced fusion performance.

Abstract: The fusion of hyperspectral image (HSI) with multispectral image (MSI) provides an effective way to enhance the spatial resolution of HSI. However, due to different acquisition conditions, there may exist spectral variability and spatially localized changes between HSI and MSI, referred to as inter-image variability, which can significantly affect the fusion performance. Existing methods typically handle inter-image variability by applying direct transformations to the images themselves, which can exacerbate the ill-posedness of the fusion model. To address this challenge, we propose a Degradation-based Low-Rank and Residual Fusion (DLRRF) model. First, we model the spectral variability as change in the spectral degradation operator. Second, to recover the lost spatial details caused by spatially localized changes, we decompose the target HSI into low rank and residual components, where the latter is used to capture the lost details. By exploiting the spectral correlation within the images, we perform dimensionality reduction on both components. Additionally, we introduce an implicit regularizer to utilize the spatial prior information from the images. The proposed DLRRF model is solved using the Proximal Alternating Optimization (PAO) algorithm within a Plug-and-Play (PnP) framework, where the subproblem regarding implicit regularizer is addressed by an external denoiser. We further provide a comprehensive convergence analysis of the algorithm. Finally, extensive numerical experiments demonstrate that DLRRF achieves superior performance in fusing HSI and MSI with inter-image variability.

[85] CellGenNet: A Knowledge-Distilled Framework for Robust Cell Segmentation in Cancer Tissues

Srijan Ray, Bikesh K. Nirala, Jason T. Yustein, Sundaresh Ram

Main category: cs.CV

TL;DR: CellGenNet is a knowledge distillation framework for robust cross-tissue cell segmentation in microscopy images using limited supervision, achieving improved accuracy over supervised and semi-supervised methods.

DetailsMotivation: Nuclei segmentation in whole slide images is challenging due to variability in staining, imaging conditions, and tissue morphology, requiring robust methods that work with limited annotations.

Method: Uses student-teacher architecture with capacity teacher trained on sparse annotations to generate soft pseudo-labels. Student optimized with joint objective combining ground-truth labels, teacher-derived probabilistic targets, and hybrid loss (binary cross-entropy + Tversky loss) with asymmetric penalties. Includes consistency regularization and layerwise dropout for stable feature representations.

Result: Experiments across diverse cancer tissue WSIs show improved segmentation accuracy and generalization over supervised and semi-supervised baselines.

Conclusion: CellGenNet supports scalable and reproducible histopathology analysis through robust cross-tissue cell segmentation under limited supervision.

Abstract: Accurate nuclei segmentation in microscopy whole slide images (WSIs) remains challenging due to variability in staining, imaging conditions, and tissue morphology. We propose CellGenNet, a knowledge distillation framework for robust cross-tissue cell segmentation under limited supervision. CellGenNet adopts a student-teacher architecture, where a capacity teacher is trained on sparse annotations and generates soft pseudo-labels for unlabeled regions. The student is optimized using a joint objective that integrates ground-truth labels, teacher-derived probabilistic targets, and a hybrid loss function combining binary cross-entropy and Tversky loss, enabling asymmetric penalties to mitigate class imbalance and better preserve minority nuclear structures. Consistency regularization and layerwise dropout further stabilize feature representations and promote reliable feature transfer. Experiments across diverse cancer tissue WSIs show that CellGenNet improves segmentation accuracy and generalization over supervised and semi-supervised baselines, supporting scalable and reproducible histopathology analysis.

[86] ProPL: Universal Semi-Supervised Ultrasound Image Segmentation via Prompt-Guided Pseudo-Labeling

Yaxiong Chen, Qicong Wang, Chunlei Li, Jingliang Hu, Yilei Shi, Shengwu Xiong, Xiao Xiang Zhu, Lichao Mou

Main category: cs.CV

TL;DR: ProPL is a universal semi-supervised framework for ultrasound image segmentation that handles multiple organs and tasks using prompt-guided dual decoders and uncertainty-driven pseudo-label calibration.

DetailsMotivation: Existing ultrasound segmentation methods are specialized for specific anatomical structures or tasks, limiting their clinical utility. There's a need for a universal approach that can handle multiple organs and leverage both labeled and unlabeled data.

Method: ProPL uses a shared vision encoder with prompt-guided dual decoders, featuring a prompting-upon-decoding mechanism for task adaptation and an uncertainty-driven pseudo-label calibration (UPLC) module for reliable self-training.

Result: ProPL outperforms state-of-the-art methods across various metrics and establishes a new benchmark for universal ultrasound image segmentation, validated on a comprehensive dataset spanning 5 organs and 8 segmentation tasks.

Conclusion: ProPL successfully pioneers universal semi-supervised ultrasound image segmentation, demonstrating superior performance and practical utility for clinical applications through its flexible task adaptation and reliable self-training capabilities.

Abstract: Existing approaches for the problem of ultrasound image segmentation, whether supervised or semi-supervised, are typically specialized for specific anatomical structures or tasks, limiting their practical utility in clinical settings. In this paper, we pioneer the task of universal semi-supervised ultrasound image segmentation and propose ProPL, a framework that can handle multiple organs and segmentation tasks while leveraging both labeled and unlabeled data. At its core, ProPL employs a shared vision encoder coupled with prompt-guided dual decoders, enabling flexible task adaptation through a prompting-upon-decoding mechanism and reliable self-training via an uncertainty-driven pseudo-label calibration (UPLC) module. To facilitate research in this direction, we introduce a comprehensive ultrasound dataset spanning 5 organs and 8 segmentation tasks. Extensive experiments demonstrate that ProPL outperforms state-of-the-art methods across various metrics, establishing a new benchmark for universal ultrasound image segmentation.

[87] ESA: Energy-Based Shot Assembly Optimization for Automatic Video Editing

Yaosen Chen, Wei Wang, Tianheng Zheng, Xuming Wen, Han Yang, Yanru Zhang

Main category: cs.CV

TL;DR: An energy-based optimization method for video shot assembly that learns from reference videos to automatically arrange shots according to narrative logic and artistic styles.

DetailsMotivation: Current intelligent video editing technologies fail to capture creators' unique artistic expression in shot assembly, which has traditionally been manually executed by experienced editors.

Method: Visual-semantic matching between LLM-generated scripts and video library, shot segmentation and attribute extraction (size, motion, semantics), energy-based models to score sequences based on reference styles, and multi-syntax rule optimization.

Result: The system produces videos that align with reference assembly styles, enabling automated shot arrangement that maintains artistic expression and creates coherent visual sequences.

Conclusion: The proposed method successfully automates shot assembly while learning reference video styles, allowing even inexperienced users to create visually compelling videos with specific artistic expression.

Abstract: Shot assembly is a crucial step in film production and video editing, involving the sequencing and arrangement of shots to construct a narrative, convey information, or evoke emotions. Traditionally, this process has been manually executed by experienced editors. While current intelligent video editing technologies can handle some automated video editing tasks, they often fail to capture the creator’s unique artistic expression in shot assembly. To address this challenge, we propose an energy-based optimization method for video shot assembly. Specifically, we first perform visual-semantic matching between the script generated by a large language model and a video library to obtain subsets of candidate shots aligned with the script semantics. Next, we segment and label the shots from reference videos, extracting attributes such as shot size, camera motion, and semantics. We then employ energy-based models to learn from these attributes, scoring candidate shot sequences based on their alignment with reference styles. Finally, we achieve shot assembly optimization by combining multiple syntax rules, producing videos that align with the assembly style of the reference videos. Our method not only automates the arrangement and combination of independent shots according to specific logic, narrative requirements, or artistic styles but also learns the assembly style of reference videos, creating a coherent visual sequence or holistic visual expression. With our system, even users with no prior video editing experience can create visually compelling videos. Project page: https://sobeymil.github.io/esa.com

[88] Evaluating Multimodal Large Language Models on Vertically Written Japanese Text

Keito Sasagawa, Shuhei Kurita, Daisuke Kawahara

Main category: cs.CV

TL;DR: Existing MLLMs perform poorly on vertically written Japanese text compared to horizontal text, but training on synthesized Japanese OCR datasets improves their vertical text reading capabilities.

DetailsMotivation: Multimodal LLMs need to process Japanese documents, which often contain vertical writing, but current research lacks focus on vertically written Japanese text.

Method: Created synthetic Japanese OCR dataset with horizontal and vertical text for fine-tuning and evaluation, plus real-world vertical document images for testing.

Result: MLLMs perform worse on vertical Japanese text than horizontal text, but training on the synthesized dataset significantly improves vertical text handling.

Conclusion: Specialized training datasets are needed for MLLMs to effectively process vertically written Japanese documents, and the proposed approach successfully addresses this gap.

Abstract: Multimodal Large Language Models (MLLMs) have seen rapid advances in recent years and are now being applied to visual document understanding tasks. They are expected to process a wide range of document images across languages, including Japanese. Understanding documents from images requires models to read what are written in them. Since some Japanese documents are written vertically, support for vertical writing is essential. However, research specifically focused on vertically written Japanese text remains limited. In this study, we evaluate the reading capability of existing MLLMs on vertically written Japanese text. First, we generate a synthetic Japanese OCR dataset by rendering Japanese texts into images, and use it for both model fine-tuning and evaluation. This dataset includes Japanese text in both horizontal and vertical writing. We also create an evaluation dataset sourced from the real-world document images containing vertically written Japanese text. Using these datasets, we demonstrate that the existing MLLMs perform worse on vertically written Japanese text than on horizontally written Japanese text. Furthermore, we show that training MLLMs on our synthesized Japanese OCR dataset results in improving the performance of models that previously could not handle vertical writing. The datasets and code are publicly available https://github.com/llm-jp/eval_vertical_ja.

[89] Reasoning via Video: The First Evaluation of Video Models’ Reasoning Abilities through Maze-Solving Tasks

Cheng Yang, Haiyuan Wan, Yiran Peng, Xin Cheng, Zhaoyang Yu, Jiayi Zhang, Junchi Yu, Xinlei Yu, Xiawu Zheng, Dongzhan Zhou, Chenglin Wu

Main category: cs.CV

TL;DR: VR-Bench is a comprehensive benchmark for evaluating video models’ reasoning capabilities through maze-solving tasks, showing that SFT can efficiently elicit reasoning abilities and video models outperform VLMs in spatial perception.

DetailsMotivation: To explore whether video models can reason via video generation, leveraging video's explicit spatial layouts and temporal continuity as an ideal substrate for spatial reasoning.

Method: Created VR-Bench with 7,920 procedurally generated videos across five maze types and diverse visual styles, using supervised fine-tuning (SFT) to elicit reasoning capabilities in video models.

Result: Video models exhibited stronger spatial perception during reasoning, outperforming leading VLMs and generalizing well across diverse scenarios. Test-time scaling with diverse sampling improved reasoning reliability by 10-20%.

Conclusion: Video models demonstrate unique potential and scalability for spatial reasoning tasks through the reasoning via video paradigm.

Abstract: Video Models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics. Analogous to the development from text generation to text-based reasoning in language modeling, the development of video models motivates us to ask: Can video models reason via video generation? Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity, which serves as an ideal substrate for spatial reasoning. In this work, we explore the reasoning via video paradigm and introduce VR-Bench – a comprehensive benchmark designed to systematically evaluate video models’ reasoning capabilities. Grounded in maze-solving tasks that inherently require spatial planning and multi-step reasoning, VR-Bench contains 7,920 procedurally generated videos across five maze types and diverse visual styles. Our empirical analysis demonstrates that SFT can efficiently elicit the reasoning ability of video model. Video models exhibit stronger spatial perception during reasoning, outperforming leading VLMs and generalizing well across diverse scenarios, tasks, and levels of complexity. We further discover a test-time scaling effect, where diverse sampling during inference improves reasoning reliability by 10–20%. These findings highlight the unique potential and scalability of reasoning via video for spatial reasoning tasks.

[90] BokehFlow: Depth-Free Controllable Bokeh Rendering via Flow Matching

Yachuan Huang, Xianrui Luo, Qiwen Wang, Liao Shen, Jiaqi Li, Huiqiang Sun, Zihao Huang, Wei Jiang, Zhiguo Cao

Main category: cs.CV

TL;DR: BokehFlow is a depth-free framework for controllable bokeh rendering using flow matching that synthesizes photorealistic bokeh effects from all-in-focus images without needing depth inputs, with semantic control via text prompts.

DetailsMotivation: Existing bokeh rendering methods require accurate depth maps or struggle with limited controllability and efficiency. There's a need for depth-free approaches that offer precise control over bokeh effects.

Method: Uses flow matching framework with cross-attention mechanism to enable semantic control over focus regions and blur intensity via text prompts. Trained on four collected and synthesized datasets.

Result: Achieves visually compelling bokeh effects with precise control, outperforming existing depth-dependent and generative methods in both rendering quality and efficiency.

Conclusion: BokehFlow provides an effective depth-free solution for controllable bokeh rendering that eliminates the need for depth inputs while maintaining high quality and efficiency.

Abstract: Bokeh rendering simulates the shallow depth-of-field effect in photography, enhancing visual aesthetics and guiding viewer attention to regions of interest. Although recent approaches perform well, rendering controllable bokeh without additional depth inputs remains a significant challenge. Existing classical and neural controllable methods rely on accurate depth maps, while generative approaches often struggle with limited controllability and efficiency. In this paper, we propose BokehFlow, a depth-free framework for controllable bokeh rendering based on flow matching. BokehFlow directly synthesizes photorealistic bokeh effects from all-in-focus images, eliminating the need for depth inputs. It employs a cross-attention mechanism to enable semantic control over both focus regions and blur intensity via text prompts. To support training and evaluation, we collect and synthesize four datasets. Extensive experiments demonstrate that BokehFlow achieves visually compelling bokeh effects and offers precise control, outperforming existing depth-dependent and generative methods in both rendering quality and efficiency.

[91] MambaTrack3D: A State Space Model Framework for LiDAR-Based Object Tracking under High Temporal Variation

Shengjing Tian, Yinan Han, Xiantong Zhao, Xuehu Liu, Qi Lang

Main category: cs.CV

TL;DR: MambaTrack3D is a novel 3D single object tracking framework for high temporal variation environments that uses Mamba-based inter-frame propagation and grouped feature enhancement to achieve near-linear complexity and superior accuracy-efficiency trade-off.

DetailsMotivation: Dynamic outdoor environments with high temporal variation pose challenges for existing memory-based trackers due to quadratic computational complexity, temporal redundancy, and insufficient exploitation of geometric priors.

Method: Uses Mamba-based Inter-frame Propagation module for efficient inter-frame modeling with near-linear complexity, and Grouped Feature Enhancement Module to separate foreground/background semantics and reduce temporal redundancy.

Result: Outperforms both HTV-oriented and normal-scenario trackers on KITTI-HTV and nuScenes-HTV benchmarks, achieving up to 6.5% success and 9.5% precision improvements over HVTrack. Remains competitive on standard KITTI dataset.

Conclusion: MambaTrack3D achieves superior accuracy-efficiency trade-off with strong generalization ability across both specialized HTV and conventional tracking scenarios.

Abstract: Dynamic outdoor environments with high temporal variation (HTV) pose significant challenges for 3D single object tracking in LiDAR point clouds. Existing memory-based trackers often suffer from quadratic computational complexity, temporal redundancy, and insufficient exploitation of geometric priors. To address these issues, we propose MambaTrack3D, a novel HTV-oriented tracking framework built upon the state space model Mamba. Specifically, we design a Mamba-based Inter-frame Propagation (MIP) module that replaces conventional single-frame feature extraction with efficient inter-frame propagation, achieving near-linear complexity while explicitly modeling spatial relations across historical frames. Furthermore, a Grouped Feature Enhancement Module (GFEM) is introduced to separate foreground and background semantics at the channel level, thereby mitigating temporal redundancy in the memory bank. Extensive experiments on KITTI-HTV and nuScenes-HTV benchmarks demonstrate that MambaTrack3D consistently outperforms both HTV-oriented and normal-scenario trackers, achieving improvements of up to 6.5 success and 9.5 precision over HVTrack under moderate temporal gaps. On the standard KITTI dataset, MambaTrack3D remains highly competitive with state-of-the-art normal-scenario trackers, confirming its strong generalization ability. Overall, MambaTrack3D achieves a superior accuracy-efficiency trade-off, delivering robust performance across both specialized HTV and conventional tracking scenarios.

[92] Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation

Firdavs Nasriddinov, Rafal Kocielnik, Anima Anandkumar, Andrew J. Hung

Main category: cs.CV

TL;DR: A structure-aware pipeline that learns surgical action ontology from trainer-trainee transcripts and uses Instrument-Action-Target triplets to condition GPT-4o for generating clinically grounded, trainer-style feedback.

DetailsMotivation: Automating natural, trainer-style feedback for surgical training to provide timely, accessible, and consistent guidance at scale, requiring models that understand clinically relevant representations.

Method: Mining IAT triplets from real-world feedback text, fine-tuning video-to-IAT model with surgical procedure context and temporal instrument motion, and using IAT representations to guide GPT-4o feedback generation.

Result: Video-to-IAT recognition improved AUC (Instrument: 0.67→0.74; Action: 0.60→0.63; Tissue: 0.74→0.79). IAT conditioning improved feedback fidelity from 2.17 to 2.44 (+12.4%), doubling admissible generations from 21% to 42%, with word error rate decreasing 15-31% and ROUGE increasing 9-64%.

Conclusion: Grounding feedback generation in explicit IAT structure improves fidelity, yields clinician-verifiable rationales, and supports auditable use in surgical training.

Abstract: High-quality intraoperative feedback from a surgical trainer is pivotal for improving trainee performance and long-term skill acquisition. Automating natural, trainer-style feedback promises timely, accessible, and consistent guidance at scale but requires models that understand clinically relevant representations. We present a structure-aware pipeline that learns a surgical action ontology from real trainer-to-trainee transcripts (33 surgeries) and uses it to condition feedback generation. We contribute by (1) mining Instrument-Action-Target (IAT) triplets from real-world feedback text and clustering surface forms into normalized categories, (2) fine-tuning a video-to-IAT model that leverages the surgical procedure and task contexts as well as fine-grained temporal instrument motion, and (3) demonstrating how to effectively use IAT triplet representations to guide GPT-4o in generating clinically grounded, trainer-style feedback. We show that, on Task 1: Video-to-IAT recognition, our context injection and temporal tracking deliver consistent AUC gains (Instrument: 0.67 to 0.74; Action: 0.60 to 0.63; Tissue: 0.74 to 0.79). For Task 2: feedback text generation (rated on a 1-5 fidelity rubric where 1 = opposite/unsafe, 3 = admissible, and 5 = perfect match to a human trainer), GPT-4o from video alone scores 2.17, while IAT conditioning reaches 2.44 (+12.4%), doubling the share of admissible generations with score >= 3 from 21% to 42%. Traditional text-similarity metrics also improve: word error rate decreases by 15-31% and ROUGE (phrase/substring overlap) increases by 9-64%. Grounding generation in explicit IAT structure improves fidelity and yields clinician-verifiable rationales, supporting auditable use in surgical training.

[93] TiCAL:Typicality-Based Consistency-Aware Learning for Multimodal Emotion Recognition

Wen Yin, Siyu Zhan, Cencen Liu, Xin Hu, Guiduo Duan, Xiurui Xie, Yuan-Fang Li, Tao He

Main category: cs.CV

TL;DR: TiCAL addresses inter-modal emotion conflicts in multimodal emotion recognition by using typicality-based consistency assessment and hyperbolic feature embedding to improve performance on inconsistent samples.

DetailsMotivation: Existing MER approaches overlook inter-modal emotion conflicts where different modalities in the same sample express divergent emotional tendencies, limiting recognition accuracy.

Method: Proposes TiCAL framework with stage-wise emotion perception, dynamic consistency assessment using pseudo unimodal labels and typicality estimation, and hyperbolic space feature embedding for fine-grained emotion distinctions.

Result: Achieves about 2.6% improvement over state-of-the-art DMD on benchmark datasets CMU-MOSEI and MER2023, particularly enhancing performance on samples with high modality inconsistency.

Conclusion: TiCAL effectively mitigates inter-modal emotional conflicts and enhances overall recognition accuracy by incorporating consistency estimates and hyperbolic feature representation.

Abstract: Multimodal Emotion Recognition (MER) aims to accurately identify human emotional states by integrating heterogeneous modalities such as visual, auditory, and textual data. Existing approaches predominantly rely on unified emotion labels to supervise model training, often overlooking a critical challenge: inter-modal emotion conflicts, wherein different modalities within the same sample may express divergent emotional tendencies. In this work, we address this overlooked issue by proposing a novel framework, Typicality-based Consistent-aware Multimodal Emotion Recognition (TiCAL), inspired by the stage-wise nature of human emotion perception. TiCAL dynamically assesses the consistency of each training sample by leveraging pseudo unimodal emotion labels alongside a typicality estimation. To further enhance emotion representation, we embed features in a hyperbolic space, enabling the capture of fine-grained distinctions among emotional categories. By incorporating consistency estimates into the learning process, our method improves model performance, particularly on samples exhibiting high modality inconsistency. Extensive experiments on benchmark datasets, e.g, CMU-MOSEI and MER2023, validate the effectiveness of TiCAL in mitigating inter-modal emotional conflicts and enhancing overall recognition accuracy, e.g., with about 2.6% improvements over the state-of-the-art DMD.

[94] Jointly Conditioned Diffusion Model for Multi-View Pose-Guided Person Image Synthesis

Chengyu Xie, Zhi Gong, Junchi Ren, Linkun Yu, Si Shen, Fei Shen, Xiaoyu Du

Main category: cs.CV

TL;DR: JCDM is a jointly conditioned diffusion framework that uses multi-view priors to improve pose-guided human image generation by addressing incomplete textures from single views and enabling cross-view interaction.

DetailsMotivation: Current pose-guided human image generation suffers from incomplete textures when using only single reference views and lacks explicit cross-view interaction mechanisms.

Method: Uses appearance prior module (APM) to infer holistic identity preserving prior from incomplete references, and joint conditional injection (JCI) mechanism to fuse multi-view cues and inject shared conditioning into denoising backbone.

Result: Achieves state-of-the-art fidelity and cross-view consistency in human image generation.

Conclusion: JCDM effectively addresses limitations of single-view references by leveraging multi-view priors and supports variable reference views with minimal architectural modifications to standard diffusion backbones.

Abstract: Pose-guided human image generation is limited by incomplete textures from single reference views and the absence of explicit cross-view interaction. We present jointly conditioned diffusion model (JCDM), a jointly conditioned diffusion framework that exploits multi-view priors. The appearance prior module (APM) infers a holistic identity preserving prior from incomplete references, and the joint conditional injection (JCI) mechanism fuses multi-view cues and injects shared conditioning into the denoising backbone to align identity, color, and texture across poses. JCDM supports a variable number of reference views and integrates with standard diffusion backbones with minimal and targeted architectural modifications. Experiments demonstrate state of the art fidelity and cross-view consistency.

[95] A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models

Duo Li, Zuhao Yang, Xiaoqin Zhang, Ling Shao, Shijian Lu

Main category: cs.CV

TL;DR: This paper analyzes visual token redundancy in discrete diffusion-based multimodal large language models (dMLLMs) and proposes modality-specific efficiency optimizations, showing that different pruning strategies work best for different dMLLM architectures.

DetailsMotivation: Existing dMLLMs suffer from computational overhead during inference due to full-sequence attention computation, and current efficiency optimizations overlook modality-specific visual token redundancy.

Method: Conducted comprehensive study on visual token redundancy evolution across different dMLLM architectures and tasks, analyzing how visual token pruning affects responses and efficiency.

Result: Found that visual redundancy emerges only in from-scratch dMLLMs during long-answer tasks, and that from-scratch dMLLMs can recover pruned information progressively during late denoising steps.

Conclusion: Layer-skipping works best for AR-to-diffusion dMLLMs, while progressive/late-step pruning is more effective for from-scratch dMLLMs, providing new efficiency optimization perspectives for multimodal understanding tasks.

Abstract: Discrete diffusion-based multimodal large language models (dMLLMs) have emerged as a promising alternative to autoregressive MLLMs thanks to their advantages in parallel decoding and bidirectional context modeling, but most existing dMLLMs incur significant computational overhead during inference due to the full-sequence attention computation in each denoising step. Pioneer studies attempt to resolve this issue from a modality-agnostic perspective via key-value cache optimization or efficient sampling but most of them overlook modality-specific visual token redundancy. In this work, we conduct a comprehensive study on how visual token redundancy evolves with different dMLLM architectures and tasks and how visual token pruning affects dMLLM responses and efficiency. Specifically, our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks. In addition, we validate that visual token pruning introduces non-negligible information loss in dMLLMs and only from-scratch dMLLMs can recover the lost information progressively during late denoising steps. Furthermore, our study shows that layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs. Overall, this work offers a new perspective on efficiency optimization for dMLLMs, greatly advancing their applicability across various multimodal understanding tasks.

[96] Gaussian Blending: Rethinking Alpha Blending in 3D Gaussian Splatting

Junseo Koo, Jinseo Jeong, Gunhee Kim

Main category: cs.CV

TL;DR: Gaussian Blending replaces alpha blending in 3D Gaussian Splatting to eliminate zooming artifacts by treating alpha and transmittance as spatial distributions rather than scalars.

DetailsMotivation: Current 3DGS methods suffer from erosion-induced blurring when zooming in and dilation-induced staircase artifacts when zooming out at unseen sampling rates, due to limitations of conventional alpha blending.

Method: Proposed Gaussian Blending that treats alpha and transmittance as spatially varying distributions, allowing nearby background splats to contribute to rendering while maintaining real-time speed and no extra memory cost.

Result: Effectively captures fine details at various unseen sampling rates, consistently outperforming existing novel view synthesis models across both seen and unseen sampling rates.

Conclusion: Gaussian Blending successfully addresses fundamental limitations of alpha blending in 3DGS, providing artifact-free rendering at different zoom levels without sacrificing performance or memory efficiency.

Abstract: The recent introduction of 3D Gaussian Splatting (3DGS) has significantly advanced novel view synthesis. Several studies have further improved the rendering quality of 3DGS, yet they still exhibit noticeable visual discrepancies when synthesizing views at sampling rates unseen during training. Specifically, they suffer from (i) erosion-induced blurring artifacts when zooming in and (ii) dilation-induced staircase artifacts when zooming out. We speculate that these artifacts arise from the fundamental limitation of the alpha blending adopted in 3DGS methods. Instead of the conventional alpha blending that computes alpha and transmittance as scalar quantities over a pixel, we propose to replace it with our novel Gaussian Blending that treats alpha and transmittance as spatially varying distributions. Thus, transmittances can be updated considering the spatial distribution of alpha values across the pixel area, allowing nearby background splats to contribute to the final rendering. Our Gaussian Blending maintains real-time rendering speed and requires no additional memory cost, while being easily integrated as a drop-in replacement into existing 3DGS-based or other NVS frameworks. Extensive experiments demonstrate that Gaussian Blending effectively captures fine details at various sampling rates unseen during training, consistently outperforming existing novel view synthesis models across both unseen and seen sampling rates.

[97] Computer-Use Agents as Judges for Generative User Interface

Kevin Qinghong Lin, Siyuan Hu, Linjie Li, Zhengyuan Yang, Lijuan Wang, Philip Torr, Mike Zheng Shou

Main category: cs.CV

TL;DR: This paper introduces AUI-Gym, a benchmark for automatic GUI development using Computer-Use Agents (CUA) as judges to assist coding-oriented language models (Coder) in designing agent-native interfaces that prioritize task efficiency over human aesthetics.

DetailsMotivation: Current GUIs are designed for humans, forcing agents to adopt inefficient human-oriented behaviors. The paper explores whether CUAs can serve as judges to help Coders create more efficient agent-native interfaces.

Method: Proposed a Coder-CUA Collaboration framework: Coder generates and revises websites, CUA evaluates functionality and refines designs. Created AUI-Gym benchmark with 52 applications and 1560 synthesized tasks, plus a verifier to ensure task executability. Designed CUA Dashboard to compress navigation histories into visual summaries.

Result: Success is measured by task solvability and CUA navigation success rate rather than visual appearance. The framework enables iterative redesign based on agent feedback.

Conclusion: The work shifts interface design toward agent-native efficiency and reliability, moving agents from passive use to active participation in digital environments.

Abstract: Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans–prioritizing aesthetics and usability–forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.

[98] Unbiased Semantic Decoding with Vision Foundation Models for Few-shot Segmentation

Jin Wang, Bingfeng Zhang, Jian Pang, Weifeng Liu, Baodi Liu, Honglong Chen

Main category: cs.CV

TL;DR: Proposes Unbiased Semantic Decoding (USD) strategy with SAM for few-shot segmentation, using CLIP’s semantic alignment to enhance SAM features and generate target-focused prompts from both support and query sets.

DetailsMotivation: Previous SAM-based few-shot segmentation approaches rely heavily on support set prompts, which insufficiently activate SAM's generalization ability and cause biased decoding for unknown classes.

Method: Uses CLIP semantic alignment to enhance SAM features via global image-level category indication and local pixel-level target location guidance. Introduces learnable visual-text target prompt generator that interacts text embeddings with CLIP visual features.

Result: The approach achieves unbiased semantic discrimination without retraining foundation models, enabling target-focused attention through rich target information in prompts.

Conclusion: USD strategy successfully addresses biased decoding in SAM-based few-shot segmentation by leveraging CLIP semantics for enhanced feature discrimination and target-focused prompt generation.

Abstract: Few-shot segmentation has garnered significant attention. Many recent approaches attempt to introduce the Segment Anything Model (SAM) to handle this task. With the strong generalization ability and rich object-specific extraction ability of the SAM model, such a solution shows great potential in few-shot segmentation. However, the decoding process of SAM highly relies on accurate and explicit prompts, making previous approaches mainly focus on extracting prompts from the support set, which is insufficient to activate the generalization ability of SAM, and this design is easy to result in a biased decoding process when adapting to the unknown classes. In this work, we propose an Unbiased Semantic Decoding (USD) strategy integrated with SAM, which extracts target information from both the support and query set simultaneously to perform consistent predictions guided by the semantics of the Contrastive Language-Image Pre-training (CLIP) model. Specifically, to enhance the unbiased semantic discrimination of SAM, we design two feature enhancement strategies that leverage the semantic alignment capability of CLIP to enrich the original SAM features, mainly including a global supplement at the image level to provide a generalize category indicate with support image and a local guidance at the pixel level to provide a useful target location with query image. Besides, to generate target-focused prompt embeddings, a learnable visual-text target prompt generator is proposed by interacting target text embeddings and clip visual features. Without requiring re-training of the vision foundation models, the features with semantic discrimination draw attention to the target region through the guidance of prompt with rich target information.

[99] WaveFuse-AL: Cyclical and Performance-Adaptive Multi-Strategy Active Learning for Medical Images

Nishchala Thakur, Swati Kochhar, Deepti R. Bathula, Sukrit Gupta

Main category: cs.CV

TL;DR: WaveFuse-AL is a multi-strategy active learning framework that adaptively fuses BALD, BADGE, Entropy, and CoreSet strategies using cyclical temporal priors and performance-driven adaptation for medical imaging tasks.

DetailsMotivation: Individual active learning strategies often show inconsistent behavior across different stages of the learning cycle, leading to suboptimal performance in medical imaging where annotation costs are high.

Method: Proposes WaveFuse-AL framework that integrates cyclical (sinusoidal) temporal priors with performance-driven adaptation to dynamically adjust the importance of multiple acquisition strategies (BALD, BADGE, Entropy, CoreSet) over time.

Result: WaveFuse-AL consistently outperforms single-strategy and alternating-strategy baselines on three medical imaging benchmarks (APTOS-2019, RSNA Pneumonia Detection, ISIC-2018), achieving statistically significant improvements on 10 out of 12 metric measurements.

Conclusion: The proposed adaptive multi-strategy fusion approach maximizes utility of limited annotation budgets in medical imaging by effectively combining complementary acquisition strategies throughout the active learning cycle.

Abstract: Active learning reduces annotation costs in medical imaging by strategically selecting the most informative samples for labeling. However, individual acquisition strategies often exhibit inconsistent behavior across different stages of the active learning cycle. We propose Cyclical and Performance-Adaptive Multi-Strategy Active Learning (WaveFuse-AL), a novel framework that adaptively fuses multiple established acquisition strategies-BALD, BADGE, Entropy, and CoreSet throughout the learning process. WaveFuse-AL integrates cyclical (sinusoidal) temporal priors with performance-driven adaptation to dynamically adjust strategy importance over time. We evaluate WaveFuse-AL on three medical imaging benchmarks: APTOS-2019 (multi-class classification), RSNA Pneumonia Detection (binary classification), and ISIC-2018 (skin lesion segmentation). Experimental results demonstrate that WaveFuse-AL consistently outperforms both single-strategy and alternating-strategy baselines, achieving statistically significant performance improvements (on ten out of twelve metric measurements) while maximizing the utility of limited annotation budgets.

[100] When to Think and When to Look: Uncertainty-Guided Lookback

Jing Bi, Filippos Bellos, Junjia Guo, Yayuan Li, Chao Huang, Yunlong, Tang, Luchuan Song, Susan Liang, Zhongfei, Zhang, Jason J. Corso, Chenliang Xu

Main category: cs.CV

TL;DR: Test-time thinking in vision-language models doesn’t always improve performance - longer reasoning chains can lead to worse results by ignoring visual information. Short lookback phrases that reference images are key to success.

DetailsMotivation: To systematically analyze how test-time thinking actually affects visual reasoning in large vision language models, since despite promising results, there's no comprehensive understanding of its mechanisms.

Method: Large-scale controlled comparison of 10 LVLM variants on MMMU-val with generous token budgets and multi-pass decoding, followed by proposing uncertainty guided lookback - a training-free decoding strategy combining uncertainty signals with adaptive lookback prompts and breadth search.

Result: More thinking is not always better; long chains often underperform standard instruct mode. Short lookback phrases correlate with better visual grounding. The proposed method improves MMMU performance, delivers largest gains where standard thinking is weak, and outperforms baselines.

Conclusion: Uncertainty guided lookback with adaptive prompts and breadth search provides consistent improvements across multiple benchmarks, setting new state-of-the-art under fixed model families and token budgets.

Abstract: Test-time thinking (that is, generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large scale, controlled comparison of thinking for LVLMs, evaluating ten variants from the InternVL3.5 and Qwen3-VL families on MMMU-val under generous token budgets and multi pass decoding. We show that more thinking is not always better; long chains often yield long wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short lookback phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty guided lookback, a training free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math focused visual reasoning datasets.

[101] DCL-SE: Dynamic Curriculum Learning for Spatiotemporal Encoding of Brain Imaging

Meihua Zhou, Xinyu Tong, Jiarui Zhao, Min Cheng, Li Yang, Lei Tian, Nan Wan

Main category: cs.CV

TL;DR: DCL-SE is an end-to-end framework that uses data-driven spatiotemporal encoding and dynamic curriculum learning to improve neuroimaging analysis, outperforming existing methods in accuracy, robustness, and interpretability across multiple clinical tasks.

DetailsMotivation: Address limitations in high-dimensional neuroimaging analysis including compromises in spatiotemporal fidelity and limited adaptability of large-scale general-purpose models.

Method: Uses Approximate Rank Pooling to encode 3D brain data into 2D dynamic representations, then employs dynamic curriculum learning with Dynamic Group Mechanism to progressively train decoder from global structures to fine pathological details.

Result: Consistently outperforms existing methods across six datasets for Alzheimer’s disease classification, brain tumor classification, cerebral artery segmentation, and brain age prediction.

Conclusion: Demonstrates the critical importance of compact, task-specific architectures in the era of large-scale pretrained networks for neuroimaging analysis.

Abstract: High-dimensional neuroimaging analyses for clinical diagnosis are often constrained by compromises in spatiotemporal fidelity and by the limited adaptability of large-scale, general-purpose models. To address these challenges, we introduce Dynamic Curriculum Learning for Spatiotemporal Encoding (DCL-SE), an end-to-end framework centered on data-driven spatiotemporal encoding (DaSE). We leverage Approximate Rank Pooling (ARP) to efficiently encode three-dimensional volumetric brain data into information-rich, two-dimensional dynamic representations, and then employ a dynamic curriculum learning strategy, guided by a Dynamic Group Mechanism (DGM), to progressively train the decoder, refining feature extraction from global anatomical structures to fine pathological details. Evaluated across six publicly available datasets, including Alzheimer’s disease and brain tumor classification, cerebral artery segmentation, and brain age prediction, DCL-SE consistently outperforms existing methods in accuracy, robustness, and interpretability. These findings underscore the critical importance of compact, task-specific architectures in the era of large-scale pretrained networks.

[102] VisPlay: Self-Evolving Vision-Language Models from Images

Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, Yonghui Yang

Main category: cs.CV

TL;DR: VisPlay is a self-evolving RL framework that enables Vision-Language Models to autonomously improve reasoning abilities using unlabeled image data through role-playing between questioner and reasoner components.

DetailsMotivation: Existing RL approaches for VLMs rely on costly human-annotated labels or task-specific heuristics, which are difficult to scale for complex reasoning tasks.

Method: Assigns VLM into two roles: Image-Conditioned Questioner that formulates visual questions and Multimodal Reasoner that generates silver responses, trained jointly with Group Relative Policy Optimization incorporating diversity and difficulty rewards.

Result: Achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks including MM-Vet and MMMU when trained on Qwen2.5-VL and MiMo-VL models.

Conclusion: Demonstrates a scalable path toward self-evolving multimodal intelligence by enabling autonomous improvement of VLMs using unlabeled data.

Abstract: Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/

[103] SceneEdited: A City-Scale Benchmark for 3D HD Map Updating via Image-Guided Change Detection

Chun-Jung Lin, Tat-Jun Chin, Sourav Garg, Feras Dayoub

Main category: cs.CV

TL;DR: SceneEdited is the first city-scale dataset for HD map maintenance through 3D point cloud updating, containing over 800 scenes with 23,000+ synthesized changes across 73km of urban area.

DetailsMotivation: HD maps quickly become outdated as environments evolve, creating a need for methods that detect changes and update 3D representations, with a gap between 2D change detection and actual 3D map updating.

Method: Created a dataset with over 800 up-to-date scenes covering 73km urban area, with 23,000+ synthesized object changes across 2000+ out-of-date versions, including RGB images, LiDAR scans, and change masks.

Result: Provides baseline methods using image-based structure-from-motion pipeline for updating outdated scenes, plus comprehensive toolkit for scalability, trackability, and portability.

Conclusion: Establishes a standardized benchmark for 3D map updating research with publicly available dataset and toolkit to support HD map maintenance.

Abstract: Accurate, up-to-date High-Definition (HD) maps are critical for urban planning, infrastructure monitoring, and autonomous navigation. However, these maps quickly become outdated as environments evolve, creating a need for robust methods that not only detect changes but also incorporate them into updated 3D representations. While change detection techniques have advanced significantly, there remains a clear gap between detecting changes and actually updating 3D maps, particularly when relying on 2D image-based change detection. To address this gap, we introduce SceneEdited, the first city-scale dataset explicitly designed to support research on HD map maintenance through 3D point cloud updating. SceneEdited contains over 800 up-to-date scenes covering 73 km of driving and approximate 3 $\text{km}^2$ of urban area, with more than 23,000 synthesized object changes created both manually and automatically across 2000+ out-of-date versions, simulating realistic urban modifications such as missing roadside infrastructure, buildings, overpasses, and utility poles. Each scene includes calibrated RGB images, LiDAR scans, and detailed change masks for training and evaluation. We also provide baseline methods using a foundational image-based structure-from-motion pipeline for updating outdated scenes, as well as a comprehensive toolkit supporting scalability, trackability, and portability for future dataset expansion and unification of out-of-date object annotations. Both the dataset and the toolkit are publicly available at https://github.com/ChadLin9596/ScenePoint-ETK, establising a standardized benchmark for 3D map updating research.

[104] MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping

Yushi Huang, Zining Wang, Zhihang Yuan, Yifu Ding, Ruihao Gong, Jinyang Guo, Xianglong Liu, Jun Zhang

Main category: cs.CV

TL;DR: MoDES is a training-free framework that efficiently skips redundant experts in Mixture-of-Experts MLLMs by integrating global layer importance into local routing and using dual-modality thresholding, achieving significant performance improvements and speedups.

DetailsMotivation: Existing expert skipping methods designed for unimodal LLMs cause performance degradation when applied to MLLMs because they don't account for heterogeneous expert contributions across layers and modality-specific token behaviors.

Method: Proposes MoDES with globally-modulated local gating (GMLG) to estimate per-token expert importance, dual-modality thresholding (DMT) for separate modality processing, and a frontier search algorithm for optimal threshold setting.

Result: Outperforms previous approaches across 13 benchmarks, achieving up to 10.67% performance improvement while skipping 88% experts, with 2.16× prefilling speedup and 1.26× decoding speedup.

Conclusion: MoDES enables efficient and accurate MoE MLLM inference by properly handling modality heterogeneity and layer-wise expert importance, significantly reducing computational overhead without training.

Abstract: Mixture-of-Experts (MoE) Multimodal large language models (MLLMs) excel at vision-language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert skipping methods have been proposed to deactivate redundant experts based on the current input tokens. However, we find that applying these methods-originally designed for unimodal large language models (LLMs)-to MLLMs results in considerable performance degradation. This is primarily because such methods fail to account for the heterogeneous contributions of experts across MoE layers and modality-specific behaviors of tokens within these layers. Motivated by these findings, we propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. It incorporates a globally-modulated local gating (GMLG) mechanism that integrates global layer-wise importance into local routing probabilities to accurately estimate per-token expert importance. A dual-modality thresholding (DMT) method is then applied, which processes tokens from each modality separately, to derive the skipping schedule. To set the optimal thresholds, we introduce a frontier search algorithm that exploits monotonicity properties, cutting convergence time from several days to a few hours. Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches. For instance, when skipping 88% experts for Qwen3-VL-MoE-30B-A3B-Instruct, the performance boost is up to 10.67% (97.33% vs. 86.66%). Furthermore, MoDES significantly enhances inference speed, improving the prefilling time by 2.16$\times$ and the decoding time by 1.26$\times$.

[105] Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance

Songze Li, Mingyu Gao, Tonghua Su, Xu-Yao Zhang, Zhongjie Wang

Main category: cs.CV

TL;DR: The paper addresses catastrophic forgetting in multimodal continual learning by treating it as missing gradients from old tasks and approximating them using geometric properties of the parameter space.

DetailsMotivation: Multimodal continual instruction tuning faces catastrophic forgetting where learning new tasks degrades performance on previous ones, limiting the practical deployment of multimodal large language models.

Method: Approximates missing gradients using directional vectors between current and previous optimal parameters, integrates with replay buffer gradients, and uses Bernoulli sampling to balance stability and plasticity.

Result: Achieves state-of-the-art performance on multimodal continual instruction tuning datasets without model expansion, effectively mitigating catastrophic forgetting while maintaining compact architecture.

Conclusion: The proposed gradient approximation approach successfully addresses catastrophic forgetting in continual learning, enabling effective sequential adaptation to new tasks while preserving previous knowledge.

Abstract: Multimodal continual instruction tuning enables multimodal large language models to sequentially adapt to new tasks while building upon previously acquired knowledge. However, this continual learning paradigm faces the significant challenge of catastrophic forgetting, where learning new tasks leads to performance degradation on previous ones. In this paper, we introduce a novel insight into catastrophic forgetting by conceptualizing it as a problem of missing gradients from old tasks during new task learning. Our approach approximates these missing gradients by leveraging the geometric properties of the parameter space, specifically using the directional vector between current parameters and previously optimal parameters as gradient guidance. This approximated gradient can be further integrated with real gradients from a limited replay buffer and regulated by a Bernoulli sampling strategy that dynamically balances model stability and plasticity. Extensive experiments on multimodal continual instruction tuning datasets demonstrate that our method achieves state-of-the-art performance without model expansion, effectively mitigating catastrophic forgetting while maintaining a compact architecture.

[106] Think Visually, Reason Textually: Vision-Language Synergy in ARC

Beichen Zhang, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang

Main category: cs.CV

TL;DR: Vision-Language Synergy Reasoning (VLSR) and Modality-Switch Self-Correction (MSSC) improve ARC-AGI performance by 4.33% over text-only baselines by leveraging complementary strengths of vision for pattern abstraction and language for rule execution.

DetailsMotivation: Current foundation models fail at abstract reasoning from minimal examples, particularly in inferring structured transformation rules from few examples - a key human intelligence capability. The ARC-AGI benchmark highlights this limitation.

Method: Two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR) decomposes ARC-AGI into modality-aligned subtasks; (2) Modality-Switch Self-Correction (MSSC) uses vision to verify text-based reasoning for error correction.

Result: Achieves up to 4.33% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks.

Conclusion: Unifying visual abstraction with linguistic reasoning is crucial for achieving generalizable, human-like intelligence in future foundation models.

Abstract: Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code will be released soon.

[107] Learning Depth from Past Selves: Self-Evolution Contrast for Robust Depth Estimation

Jing Cao, Kui Jiang, Shenyi Li, Xiaocheng Feng, Yong Huang

Main category: cs.CV

TL;DR: SEC-Depth: A self-evolution contrastive learning framework for robust self-supervised depth estimation under adverse weather conditions like rain and fog.

DetailsMotivation: Existing self-supervised depth estimation methods suffer significant performance degradation under adverse weather conditions where reduced visibility impairs depth prediction.

Method: Proposes a self-evolution contrastive learning framework that leverages intermediate training parameters to construct temporally evolving latency models, with a self-evolution contrastive loss that treats historical model outputs as negative samples.

Result: The method integrates seamlessly into diverse baseline models and significantly enhances robustness in zero-shot evaluations under challenging weather conditions.

Conclusion: SEC-Depth effectively mitigates performance loss in adverse weather without manual intervention by adaptively adjusting learning objectives and implicitly sensing weather degradation severity.

Abstract: Self-supervised depth estimation has gained significant attention in autonomous driving and robotics. However, existing methods exhibit substantial performance degradation under adverse weather conditions such as rain and fog, where reduced visibility critically impairs depth prediction. To address this issue, we propose a novel self-evolution contrastive learning framework called SEC-Depth for self-supervised robust depth estimation tasks. Our approach leverages intermediate parameters generated during training to construct temporally evolving latency models. Using these, we design a self-evolution contrastive scheme to mitigate performance loss under challenging conditions. Concretely, we first design a dynamic update strategy of latency models for the depth estimation task to capture optimization states across training stages. To effectively leverage latency models, we introduce a self-evolution contrastive Loss (SECL) that treats outputs from historical latency models as negative samples. This mechanism adaptively adjusts learning objectives while implicitly sensing weather degradation severity, reducing the needs for manual intervention. Experiments show that our method integrates seamlessly into diverse baseline models and significantly enhances robustness in zero-shot evaluations.

[108] MMCM: Multimodality-aware Metric using Clustering-based Modes for Probabilistic Human Motion Prediction

Kyotaro Tokoro, Hiromu Taketsugu, Norimichi Ukita

Main category: cs.CV

TL;DR: Proposes MMCM, a novel metric for evaluating multimodal human motion prediction that assesses both coverage across motion modes and kinematic validity, addressing limitations of existing metrics.

DetailsMotivation: Existing metrics for probabilistic human motion prediction fail to properly evaluate multimodal predictions - they reward wide distributions even for single-mode or kinematically invalid motions, lacking explicit evaluation of coverage across modes and validity.

Method: MMCM divides motion space into clusters representing modes, uses these to explicitly evaluate coverage across multiple modes, and identifies valid modes by collecting possible future motions from motion datasets to assess kinematic validity.

Result: Experiments validate that the clustering approach yields sensible mode definitions and MMCM accurately scores multimodal predictions, providing better evaluation of both coverage and validity criteria.

Conclusion: MMCM successfully addresses the limitations of existing metrics by providing a comprehensive evaluation framework that properly assesses both multimodality coverage and kinematic validity in human motion prediction.

Abstract: This paper proposes a novel metric for Human Motion Prediction (HMP). Since a single past sequence can lead to multiple possible futures, a probabilistic HMP method predicts such multiple motions. While a single motion predicted by a deterministic method is evaluated only with the difference from its ground truth motion, multiple predicted motions should also be evaluated based on their distribution. For this evaluation, this paper focuses on the following two criteria. \textbf{(a) Coverage}: motions should be distributed among multiple motion modes to cover diverse possibilities. \textbf{(b) Validity}: motions should be kinematically valid as future motions observable from a given past motion. However, existing metrics simply appreciate widely distributed motions even if these motions are observed in a single mode and kinematically invalid. To resolve these disadvantages, this paper proposes a Multimodality-aware Metric using Clustering-based Modes (MMCM). For (a) coverage, MMCM divides a motion space into several clusters, each of which is regarded as a mode. These modes are used to explicitly evaluate whether predicted motions are distributed among multiple modes. For (b) validity, MMCM identifies valid modes by collecting possible future motions from a motion dataset. Our experiments validate that our clustering yields sensible mode definitions and that MMCM accurately scores multimodal predictions. Code: https://github.com/placerkyo/MMCM

[109] Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset

Geon Choi, Hangyul Yoon, Hyunju Shin, Hyunki Park, Sang Hoon Seo, Eunho Yang, Edward Choi

Main category: cs.CV

TL;DR: The paper introduces ILS (instruction-guided lesion segmentation) paradigm and MIMIC-ILS dataset to enable chest X-ray lesion segmentation using simple user instructions, addressing limitations of current models that require expert-level text inputs.

DetailsMotivation: Current chest X-ray lesion segmentation models are limited by small target labels and reliance on complex expert-level text inputs, creating barriers to practical clinical use.

Method: Created MIMIC-ILS dataset with 1.1M instruction-answer pairs using automated multimodal pipeline from chest X-ray images and reports. Developed ROSALIA vision-language model fine-tuned on this dataset for instruction-guided segmentation.

Result: ROSALIA achieves high segmentation and textual accuracy, demonstrating effectiveness of the pipeline. MIMIC-ILS contains 192K images, 91K unique masks covering 7 major lesion types.

Conclusion: The ILS paradigm and MIMIC-ILS dataset provide a foundational resource for pixel-level chest X-ray lesion grounding, enabling user-friendly lesion segmentation through simple instructions.

Abstract: The applicability of current lesion segmentation models for chest X-rays (CXRs) has been limited both by a small number of target labels and the reliance on long, detailed expert-level text inputs, creating a barrier to practical use. To address these limitations, we introduce a new paradigm: instruction-guided lesion segmentation (ILS), which is designed to segment diverse lesion types based on simple, user-friendly instructions. Under this paradigm, we construct MIMIC-ILS, the first large-scale instruction-answer dataset for CXR lesion segmentation, using our fully automated multimodal pipeline that generates annotations from chest X-ray images and their corresponding reports. MIMIC-ILS contains 1.1M instruction-answer pairs derived from 192K images and 91K unique segmentation masks, covering seven major lesion types. To empirically demonstrate its utility, we introduce ROSALIA, a vision-language model fine-tuned on MIMIC-ILS. ROSALIA can segment diverse lesions and provide textual explanations in response to user instructions. The model achieves high segmentation and textual accuracy in our newly proposed task, highlighting the effectiveness of our pipeline and the value of MIMIC-ILS as a foundational resource for pixel-level CXR lesion grounding.

[110] BrainRotViT: Transformer-ResNet Hybrid for Explainable Modeling of Brain Aging from 3D sMRI

Wasif Jalal, Md Nafiu Rahman, M. Sohel Rahman

Main category: cs.CV

TL;DR: BrainRotViT is a hybrid model combining vision transformers and residual CNNs for accurate brain age estimation from MRI, achieving state-of-the-art performance across multiple datasets while providing interpretable aging biomarkers.

DetailsMotivation: Traditional brain age estimation methods face limitations like manual feature engineering, limited receptive fields, and overfitting. Pure transformers require large datasets and high computational costs, creating a need for more efficient and generalizable approaches.

Method: A hybrid architecture with ViT encoder trained on auxiliary age/sex classification, then frozen and applied to sagittal slices to generate embeddings. These are fed into a residual CNN regressor that incorporates subject sex at the final layer for continuous brain age prediction.

Result: Achieved MAE of 3.34 years (r=0.98, ρ=0.97, R²=0.95) across 11 MRI datasets from 130+ sites, outperforming baselines. Generalized well to 4 independent cohorts with MAEs 3.77-5.04 years. Identified aging patterns associated with Alzheimer’s, cognitive impairment, and autism.

Conclusion: BrainRotViT provides an efficient, interpretable, and generalizable framework for brain-age prediction, bridging CNN and transformer approaches while enabling new research avenues in aging and neurodegeneration.

Abstract: Accurate brain age estimation from structural MRI is a valuable biomarker for studying aging and neurodegeneration. Traditional regression and CNN-based methods face limitations such as manual feature engineering, limited receptive fields, and overfitting on heterogeneous data. Pure transformer models, while effective, require large datasets and high computational cost. We propose Brain ResNet over trained Vision Transformer (BrainRotViT), a hybrid architecture that combines the global context modeling of vision transformers (ViT) with the local refinement of residual CNNs. A ViT encoder is first trained on an auxiliary age and sex classification task to learn slice-level features. The frozen encoder is then applied to all sagittal slices to generate a 2D matrix of embedding vectors, which is fed into a residual CNN regressor that incorporates subject sex at the final fully-connected layer to estimate continuous brain age. Our method achieves an MAE of 3.34 years (Pearson $r=0.98$, Spearman $ρ=0.97$, $R^2=0.95$) on validation across 11 MRI datasets encompassing more than 130 acquisition sites, outperforming baseline and state-of-the-art models. It also generalizes well across 4 independent cohorts with MAEs between 3.77 and 5.04 years. Analyses on the brain age gap (the difference between the predicted age and actual age) show that aging patterns are associated with Alzheimer’s disease, cognitive impairment, and autism spectrum disorder. Model attention maps highlight aging-associated regions of the brain, notably the cerebellar vermis, precentral and postcentral gyri, temporal lobes, and medial superior frontal gyrus. Our results demonstrate that this method provides an efficient, interpretable, and generalizable framework for brain-age prediction, bridging the gap between CNN- and transformer-based approaches while opening new avenues for aging and neurodegeneration research.

[111] Insert In Style: A Zero-Shot Generative Framework for Harmonious Cross-Domain Object Composition

Raghu Vamsi Chittersu, Yuvraj Singh Rathore, Pranav Adlinge, Kunal Swami

Main category: cs.CV

TL;DR: Insert In Style is a zero-shot generative framework for high-fidelity object composition in stylized domains without requiring per-subject finetuning or text prompts.

DetailsMotivation: Existing methods fail when inserting real-world objects into stylized domains - practical methods lack generative fidelity while generators require impractical online finetuning.

Method: Multi-stage training protocol that disentangles identity, style, and composition representations, combined with specialized masked-attention architecture to enforce disentanglement and prevent concept interference.

Result: State-of-the-art performance on new benchmark, significantly outperforming existing methods on identity and style metrics, with strong user study validation.

Conclusion: The framework provides the first practical, high-fidelity zero-shot solution for stylized object composition without requiring text prompts or per-subject finetuning.

Abstract: Reference-based object composition methods fail when inserting real-world objects into stylized domains. This under-explored problem is currently split between practical “blenders” that lack generative fidelity and “generators” that require impractical, per-subject online finetuning. In this work, we introduce Insert In Style, the first zero-shot generative framework that is both practical and high-fidelity. Our core contribution is a unified framework with two key innovations: (i) a novel multi-stage training protocol that disentangles representations for identity, style, and composition, and (ii) a specialized masked-attention architecture that surgically enforces this disentanglement during generation. This approach prevents the concept interference common in general-purpose, unified-attention models. Our framework is trained on a new 100k sample dataset, curated from a novel data pipeline. This pipeline couples large-scale generation with a rigorous, two-stage filtering process to ensure both high-fidelity semantic identity and style coherence. Unlike prior work, our model is truly zero-shot and requires no text prompts. We also introduce a new public benchmark for stylized composition. We demonstrate state-of-the-art performance, significantly outperforming existing methods on both identity and style metrics, a result strongly corroborated by user studies.

[112] Physics-Based Benchmarking Metrics for Multimodal Synthetic Images

Kishor Datta Gupta, Marufa Kamal, Md. Mahfuzur Rahman, Fahad Rahman, Mohd Ariful Haque, Sunzida Siddique

Main category: cs.CV

TL;DR: Proposes PCMDE, a physics-constrained multimodal evaluation metric that combines LLMs with reasoning, knowledge mapping, and VLMs to address limitations of current metrics like BLEU, CIDEr, and CLIPScore in capturing semantic/structural accuracy.

DetailsMotivation: Current evaluation metrics fail to adequately capture semantic or structural accuracy, especially in domain-specific or context-dependent scenarios, necessitating a more comprehensive evaluation approach.

Method: Three-stage architecture: (1) multimodal feature extraction using object detection and VLMs; (2) Confidence-Weighted Component Fusion for adaptive validation; (3) physics-guided reasoning with LLMs for structural/relational constraint enforcement.

Result: PCMDE metric designed to overcome limitations of existing metrics by incorporating physics constraints and multimodal reasoning capabilities.

Conclusion: The proposed PCMDE framework provides a more robust evaluation approach that can better assess semantic and structural accuracy through physics-constrained multimodal reasoning.

Abstract: Current state of the art measures like BLEU, CIDEr, VQA score, SigLIP-2 and CLIPScore are often unable to capture semantic or structural accuracy, especially for domain-specific or context-dependent scenarios. For this, this paper proposes a Physics-Constrained Multimodal Data Evaluation (PCMDE) metric combining large language models with reasoning, knowledge based mapping and vision-language models to overcome these limitations. The architecture is comprised of three main stages: (1) feature extraction of spatial and semantic information with multimodal features through object detection and VLMs; (2) Confidence-Weighted Component Fusion for adaptive component-level validation; and (3) physics-guided reasoning using large language models for structural and relational constraints (e.g., alignment, position, consistency) enforcement.

[113] SkinGPT-R1: Adapter-Only Dual Distillation for Efficient Dermatology Reasoning

Yuhao Shen, Jiahe Qian, Zhangtianyi Chen, Yuanhao He, Juexiao Zhou

Main category: cs.CV

TL;DR: SkinGPT-R1 is a dermatology-focused vision language model that uses explicit chain-of-thought reasoning for skin disease diagnosis, achieving top performance on dermatology benchmarks with 41% improvement over baseline models.

DetailsMotivation: To create a dermatology AI system that makes diagnostic reasoning transparent, verifiable, and clinically aligned by developing standardized chain-of-thought narratives and evaluation metrics.

Method: Built DermCoT corpus with 13,000 dermatology cases (10,000 filtered + 3,000 certified), developed DermEval 6-dimensional evaluator and DermBench benchmark, and implemented dermatology-aware visual distillation on top of chain-of-thought supervision.

Result: SkinGPT-R1 achieved 4.031/5 average score on DermBench, ranking 1st among 14 models with 41% improvement over Vision-R1, plus stable accuracy gains on three dermatology classification benchmarks.

Conclusion: DermCoT-based chain-of-thought supervision significantly improves dermatology AI performance, and adding dermatology-aware visual distillation provides consistent gains in both reasoning quality and disease recognition.

Abstract: We present SkinGPT-R1, a dermatology focused vision language model that makes diagnostic chain of thought reasoning explicit, step by step, and verifiable. To support skin specific reasoning, we build DermCoT, a corpus of standardized dermatologic chain of thought narratives that combines 10,000 DermEval filtered training cases with 3,000 dermatologist scored certified cases, and we define DermEval as a physician aligned six dimensional evaluator and DermBench as the corresponding benchmark for dermatologic chain of thought quality. On DermBench, across 14 general, reasoning, and medical vision language models, SkinGPT-R1 achieves an average score of 4.031 out of 5 over the six clinician defined dimensions, ranks 1st among all systems, and improves the average score over Vision-R1 by about 41%. On three dermatology classification benchmarks, SkinGPT-R1 delivers stable accuracy gains over Vision-R1 and remains competitive among strong vision language models. Ablation results further show that DermCoT based chain of thought supervision provides substantial improvements over the base model and that adding dermatology aware visual distillation yields consistent additional gains in both narrative quality and recognition.

[114] SplitFlux: Learning to Decouple Content and Style from a Single Image

Yitong Yang, Yinglin Wang, Changshuo Wang, Yongjun Zhang, Ziyang Chen, Shuting He

Main category: cs.CV

TL;DR: SplitFlux is a novel method that effectively disentangles image content and style in Flux models by fine-tuning single dream blocks via LoRA, enabling high-quality customized image generation with superior content preservation and stylization.

DetailsMotivation: Existing SDXL-based methods struggle with high-quality results, and the recent Flux model fails to achieve effective content-style separation due to its underexplored characteristics, creating a need for better disentanglement approaches.

Method: SplitFlux uses two key components: (1) Rank-Constrained Adaptation that compresses rank and amplifies magnitude updates in specific blocks to prevent content leakage, and (2) Visual-Gated LoRA that splits content LoRA into high-rank (primary subject) and low-rank (residual details) branches guided by image saliency.

Result: Extensive experiments show SplitFlux consistently outperforms state-of-the-art methods, achieving superior content preservation and stylization quality across diverse scenarios.

Conclusion: SplitFlux successfully addresses content-style disentanglement challenges in Flux models through systematic analysis and targeted fine-tuning of single dream blocks, enabling effective re-embedding of disentangled content into new contexts.

Abstract: Disentangling image content and style is essential for customized image generation. Existing SDXL-based methods struggle to achieve high-quality results, while the recently proposed Flux model fails to achieve effective content-style separation due to its underexplored characteristics. To address these challenges, we conduct a systematic analysis of Flux and make two key observations: (1) Single Dream Blocks are essential for image generation; and (2) Early single stream blocks mainly control content, whereas later blocks govern style. Based on these insights, we propose SplitFlux, which disentangles content and style by fine-tuning the single dream blocks via LoRA, enabling the disentangled content to be re-embedded into new contexts. It includes two key components: (1) Rank-Constrained Adaptation. To preserve content identity and structure, we compress the rank and amplify the magnitude of updates within specific blocks, preventing content leakage into style blocks. (2) Visual-Gated LoRA. We split the content LoRA into two branches with different ranks, guided by image saliency. The high-rank branch preserves primary subject information, while the low-rank branch encodes residual details, mitigating content overfitting and enabling seamless re-embedding. Extensive experiments demonstrate that SplitFlux consistently outperforms state-of-the-art methods, achieving superior content preservation and stylization quality across diverse scenarios.

[115] Graph Query Networks for Object Detection with Automotive Radar

Loveneet Saini, Hasan Tercan, Tobias Meisen

Main category: cs.CV

TL;DR: Graph Query Networks (GQN) is a novel attention-based framework for 3D radar object detection that models objects as graphs, achieving significant performance improvements over prior methods while reducing computational overhead.

DetailsMotivation: Traditional grid and sequence-based detectors struggle with radar's sparse and irregular reflections, necessitating a new approach that can better handle radar's unique data characteristics for 360-degree automotive perception.

Method: GQN uses graph queries to dynamically attend over BEV space, constructing object-specific graphs processed by EdgeFocus for relational reasoning and DeepContext Pooling for contextual aggregation.

Result: On NuScenes dataset, GQN improves relative mAP by up to +53%, including +8.2% gain over strongest prior radar method, while reducing peak graph construction overhead by 80% with moderate FLOPs cost.

Conclusion: GQN effectively addresses radar detection challenges through graph-based modeling and attention mechanisms, demonstrating superior performance and efficiency for automotive radar perception.

Abstract: Object detection with 3D radar is essential for 360-degree automotive perception, but radar’s long wavelengths produce sparse and irregular reflections that challenge traditional grid and sequence-based convolutional and transformer detectors. This paper introduces Graph Query Networks (GQN), an attention-based framework that models objects sensed by radar as graphs, to extract individualized relational and contextual features. GQN employs a novel concept of graph queries to dynamically attend over the bird’s-eye view (BEV) space, constructing object-specific graphs processed by two novel modules: EdgeFocus for relational reasoning and DeepContext Pooling for contextual aggregation. On the NuScenes dataset, GQN improves relative mAP by up to +53%, including a +8.2% gain over the strongest prior radar method, while reducing peak graph construction overhead by 80% with moderate FLOPs cost.

[116] Edge-Centric Relational Reasoning for 3D Scene Graph Prediction

Yanni Ma, Hao Liu, Yulan Guo, Theo Gevers, Martin R. Oswald

Main category: cs.CV

TL;DR: LEO is a framework that transforms scene graphs into line graphs to enable edge-centric relational reasoning, capturing high-order dependencies for improved 3D scene graph prediction.

DetailsMotivation: Existing object-centric graph neural networks restrict relation representations to pairwise object context, making it difficult to capture essential high-order relational dependencies for accurate relation prediction.

Method: LEO first predicts potential links between object pairs to suppress irrelevant edges, transforms the scene graph into a line graph where relations become nodes, applies line graph neural networks for edge-centric reasoning, and integrates enriched relation features back into the object-centric graph.

Result: Experiments on the 3DSSG dataset with two competitive baselines show consistent improvements, demonstrating the effectiveness of the edge-to-object reasoning paradigm.

Conclusion: The proposed LEO framework enables progressive reasoning from relation-level context to object-level understanding, is model-agnostic, and significantly improves 3D scene graph prediction by capturing high-order relational dependencies.

Abstract: 3D scene graph prediction aims to abstract complex 3D environments into structured graphs consisting of objects and their pairwise relationships. Existing approaches typically adopt object-centric graph neural networks, where relation edge features are iteratively updated by aggregating messages from connected object nodes. However, this design inherently restricts relation representations to pairwise object context, making it difficult to capture high-order relational dependencies that are essential for accurate relation prediction. To address this limitation, we propose a Link-guided Edge-centric relational reasoning framework with Object-aware fusion, namely LEO, which enables progressive reasoning from relation-level context to object-level understanding. Specifically, LEO first predicts potential links between object pairs to suppress irrelevant edges, and then transforms the original scene graph into a line graph where each relation is treated as a node. A line graph neural network is applied to perform edge-centric relational reasoning to capture inter-relation context. The enriched relation features are subsequently integrated into the original object-centric graph to enhance object-level reasoning and improve relation prediction. Our framework is model-agnostic and can be integrated with any existing object-centric method. Experiments on the 3DSSG dataset with two competitive baselines show consistent improvements, highlighting the effectiveness of our edge-to-object reasoning paradigm.

[117] Taming Generative Synthetic Data for X-ray Prohibited Item Detection

Jialong Sun, Hongguang Zhu, Weizhe Liu, Yunda Sun, Renshuai Tao, Yunchao Wei

Main category: cs.CV

TL;DR: Xsyn is a one-stage X-ray security image synthesis method using text-to-image generation that eliminates labor-intensive foreground extraction, achieving better detection performance than previous two-stage methods.

DetailsMotivation: Traditional X-ray security image synthesis requires time-consuming foreground extraction and annotation, making data collection expensive and inefficient for prohibited item detection models.

Method: Proposes a one-stage pipeline using text-to-image generation with Cross-Attention Refinement (CAR) for bounding box annotation and Background Occlusion Modeling (BOM) to enhance imaging complexity in latent space.

Result: Achieves 1.2% mAP improvement over previous methods and demonstrates improved prohibited item detection performance across various X-ray security datasets and detectors.

Conclusion: Xsyn is the first method to achieve high-quality X-ray security image synthesis without extra labor cost, providing an efficient solution for scaling up training datasets.

Abstract: Training prohibited item detection models requires a large amount of X-ray security images, but collecting and annotating these images is time-consuming and laborious. To address data insufficiency, X-ray security image synthesis methods composite images to scale up datasets. However, previous methods primarily follow a two-stage pipeline, where they implement labor-intensive foreground extraction in the first stage and then composite images in the second stage. Such a pipeline introduces inevitable extra labor cost and is not efficient. In this paper, we propose a one-stage X-ray security image synthesis pipeline (Xsyn) based on text-to-image generation, which incorporates two effective strategies to improve the usability of synthetic images. The Cross-Attention Refinement (CAR) strategy leverages the cross-attention map from the diffusion model to refine the bounding box annotation. The Background Occlusion Modeling (BOM) strategy explicitly models background occlusion in the latent space to enhance imaging complexity. To the best of our knowledge, compared with previous methods, Xsyn is the first to achieve high-quality X-ray security image synthesis without extra labor cost. Experiments demonstrate that our method outperforms all previous methods with 1.2% mAP improvement, and the synthetic images generated by our method are beneficial to improve prohibited item detection performance across various X-ray security datasets and detectors. Code is available at https://github.com/pILLOW-1/Xsyn/.

[118] Text2Loc++: Generalizing 3D Point Cloud Localization from Natural Language

Yan Xia, Letian Shi, Yilin Di, Joao F. Henriques, Daniel Cremers

Main category: cs.CV

TL;DR: Text2Loc++ is a neural network for 3D point cloud localization using natural language descriptions, featuring a coarse-to-fine pipeline with novel training techniques and achieving 15% improvement over existing methods.

DetailsMotivation: To address the challenge of localizing 3D point cloud submaps using complex and diverse natural language descriptions in urban environments, requiring effective cross-modal alignment between language and point clouds.

Method: Uses a coarse-to-fine pipeline: global place recognition with Hierarchical Transformer with Max pooling (HTM) and attention-based point cloud encoder, Masked Instance Training (MIT) for robustness, Modality-aware Hierarchical Contrastive Learning (MHCL) for embedding enhancement, and fine localization with Prototype-based Map Cloning (PMC) and Cascaded Cross-Attention Transformer (CCAT).

Result: Outperforms existing methods by up to 15% on KITTI360Pose dataset and shows robust generalization on new city-scale dataset with diverse urban scenes and linguistic complexity levels.

Conclusion: Text2Loc++ effectively handles complex linguistic expressions and diverse urban environments through its novel architecture and training techniques, demonstrating superior performance in 3D point cloud localization from natural language descriptions.

Abstract: We tackle the problem of localizing 3D point cloud submaps using complex and diverse natural language descriptions, and present Text2Loc++, a novel neural network designed for effective cross-modal alignment between language and point clouds in a coarse-to-fine localization pipeline. To support benchmarking, we introduce a new city-scale dataset covering both color and non-color point clouds from diverse urban scenes, and organize location descriptions into three levels of linguistic complexity. In the global place recognition stage, Text2Loc++ combines a pretrained language model with a Hierarchical Transformer with Max pooling (HTM) for sentence-level semantics, and employs an attention-based point cloud encoder for spatial understanding. We further propose Masked Instance Training (MIT) to filter out non-aligned objects and improve multimodal robustness. To enhance the embedding space, we introduce Modality-aware Hierarchical Contrastive Learning (MHCL), incorporating cross-modal, submap-, text-, and instance-level losses. In the fine localization stage, we completely remove explicit text-instance matching and design a lightweight yet powerful framework based on Prototype-based Map Cloning (PMC) and a Cascaded Cross-Attention Transformer (CCAT). Extensive experiments on the KITTI360Pose dataset show that Text2Loc++ outperforms existing methods by up to 15%. In addition, the proposed model exhibits robust generalization when evaluated on the new dataset, effectively handling complex linguistic expressions and a wide variety of urban environments. The code and dataset will be made publicly available.

[119] Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models

Mehran Tamjidi, Hamidreza Dastmalchi, Mohammadreza Alimoradijazi, Ali Cheraghian, Aijun An, Morteza Saberi

Main category: cs.CV

TL;DR: Uni-Adapter is a training-free online test-time adaptation method for 3D vision-language foundation models that uses dynamic prototype learning to handle noisy, incomplete, or distribution-shifted data.

DetailsMotivation: 3D vision-language foundation models underperform in practical scenarios with noisy, incomplete, or distribution-shifted data, requiring adaptation without retraining.

Method: Uses dynamic prototype learning with a 3D cache storing class-specific cluster centers, graph-based label smoothing for inter-prototype consistency, and entropy-weighted aggregation to unify predictions.

Result: Achieves state-of-the-art performance: 10.55% improvement on ModelNet-40C, 8.26% on ScanObjectNN-C, and 4.49% on ShapeNet-C over source models.

Conclusion: Uni-Adapter effectively mitigates distribution shifts in 3D vision-language foundation models without retraining, demonstrating strong generalization across diverse 3D benchmarks.

Abstract: 3D Vision-Language Foundation Models (VLFMs) have shown strong generalization and zero-shot recognition capabilities in open-world point cloud processing tasks. However, these models often underperform in practical scenarios where data are noisy, incomplete, or drawn from a different distribution than the training data. To address this, we propose Uni-Adapter, a novel training-free online test-time adaptation (TTA) strategy for 3D VLFMs based on dynamic prototype learning. We define a 3D cache to store class-specific cluster centers as prototypes, which are continuously updated to capture intra-class variability in heterogeneous data distributions. These dynamic prototypes serve as anchors for cache-based logit computation via similarity scoring. Simultaneously, a graph-based label smoothing module captures inter-prototype similarities to enforce label consistency among similar prototypes. Finally, we unify predictions from the original 3D VLFM and the refined 3D cache using entropy-weighted aggregation for reliable adaptation. Without retraining, Uni-Adapter effectively mitigates distribution shifts, achieving state-of-the-art performance on diverse 3D benchmarks over different 3D VLFMs, improving ModelNet-40C by 10.55%, ScanObjectNN-C by 8.26%, and ShapeNet-C by 4.49% over the source 3D VLFMs.

[120] A Multimodal Transformer Approach for UAV Detection and Aerial Object Recognition Using Radar, Audio, and Video Data

Mauro Larrat, Claudomiro Sales

Main category: cs.CV

TL;DR: A multimodal Transformer model integrating radar, RGB video, IR video, and audio achieves state-of-the-art UAV detection with 98.12% accuracy and real-time performance (41.11 FPS).

DetailsMotivation: Overcome limitations of single-modality approaches for UAV detection and aerial object recognition in surveillance and security applications.

Method: Design a novel multimodal Transformer model that fuses diverse data streams (radar, RGB video, IR video, audio) using self-attention mechanisms to learn complementary representations.

Result: Achieved exceptional performance: 0.9812 accuracy, 0.9873 recall, 0.9787 precision, 0.9826 F1-score, 0.9954 specificity. High efficiency with 1.09 GFLOPs, 1.22M parameters, and 41.11 FPS inference speed.

Conclusion: Multimodal data fusion via Transformer architecture provides a highly accurate, resilient solution for real-time UAV detection and monitoring in complex airspace.

Abstract: Unmanned aerial vehicle (UAV) detection and aerial object recognition are critical for modern surveillance and security, prompting a need for robust systems that overcome limitations of single-modality approaches. This research addresses these challenges by designing and rigorously evaluating a novel multimodal Transformer model that integrates diverse data streams: radar, visual band video (RGB), infrared (IR) video, and audio. The architecture effectively fuses distinct features from each modality, leveraging the Transformer’s self-attention mechanisms to learn comprehensive, complementary, and highly discriminative representations for classification. The model demonstrated exceptional performance on an independent test set, achieving macro-averaged metrics of 0.9812 accuracy, 0.9873 recall, 0.9787 precision, 0.9826 F1-score, and 0.9954 specificity. Notably, it exhibited particularly high precision and recall in distinguishing drones from other aerial objects. Furthermore, computational analysis confirmed its efficiency, with 1.09 GFLOPs, 1.22 million parameters, and an inference speed of 41.11 FPS, highlighting its suitability for real-time applications. This study presents a significant advancement in aerial object classification, validating the efficacy of multimodal data fusion via a Transformer architecture for achieving state-of-the-art performance, thereby offering a highly accurate and resilient solution for UAV detection and monitoring in complex airspace.

[121] What Your Features Reveal: Data-Efficient Black-Box Feature Inversion Attack for Split DNNs

Zhihan Ren, Lijun He, Jiaxi Liang, Xinzhu Fu, Haixia Bi, Fan Li

Main category: cs.CV

TL;DR: FIA-Flow is a black-box feature inversion attack framework that achieves high-fidelity image reconstruction from intermediate features in Split DNNs, revealing more severe privacy risks than previously recognized.

DetailsMotivation: Split DNNs expose privacy vulnerabilities as intermediate features can be exploited to reconstruct private inputs via Feature Inversion Attack (FIA), but existing methods produce limited reconstruction quality, making it difficult to assess true privacy leakage.

Method: FIA-Flow uses Latent Feature Space Alignment Module (LFSAM) to bridge semantic gap between intermediate features and latent space, and Deterministic Inversion Flow Matching (DIFM) to project off-manifold features onto target manifold with one-step inference, enabling effective training with few image-feature pairs.

Result: Experiments show FIA-Flow achieves more faithful and semantically aligned feature inversion across various models (AlexNet, ResNet, Swin Transformer, DINO, YOLO11) and layers, outperforming existing methods.

Conclusion: FIA-Flow reveals a more severe privacy threat in Split DNNs than previously recognized, demonstrating high-fidelity image reconstruction from intermediate features and highlighting significant privacy risks.

Abstract: Split DNNs enable edge devices by offloading intensive computation to a cloud server, but this paradigm exposes privacy vulnerabilities, as the intermediate features can be exploited to reconstruct the private inputs via Feature Inversion Attack (FIA). Existing FIA methods often produce limited reconstruction quality, making it difficult to assess the true extent of privacy leakage. To reveal the privacy risk of the leaked features, we introduce FIA-Flow, a black-box FIA framework that achieves high-fidelity image reconstruction from intermediate features. To exploit the semantic information within intermediate features, we design a Latent Feature Space Alignment Module (LFSAM) to bridge the semantic gap between the intermediate feature space and the latent space. Furthermore, to rectify distributional mismatch, we develop Deterministic Inversion Flow Matching (DIFM), which projects off-manifold features onto the target manifold with one-step inference. This decoupled design simplifies learning and enables effective training with few image-feature pairs. To quantify privacy leakage from a human perspective, we also propose two metrics based on a large vision-language model. Experiments show that FIA-Flow achieves more faithful and semantically aligned feature inversion across various models (AlexNet, ResNet, Swin Transformer, DINO, and YOLO11) and layers, revealing a more severe privacy threat in Split DNNs than previously recognized.

[122] Adaptive thresholding pattern for fingerprint forgery detection

Zahra Farzadpour, Masoumeh Azghani

Main category: cs.CV

TL;DR: A fingerprint liveness detection method using adaptive thresholding and wavelet transform to distinguish fake fingerprints from real ones, with enhanced resistance to various distortions.

DetailsMotivation: To address the threat of fingerprint spoofing in biometric systems by developing a robust detection technique that can withstand various distortions like noise, pixel missing, and block missing.

Method: Uses anisotropic diffusion, three-level wavelet transform, adaptive thresholding of coefficients, feature vector concatenation, and SVM classification for fingerprint forgery detection.

Result: Outperforms existing methods by approximately 8% in accuracy for 90% pixel missing scenarios and 5% for 70x70 block missing scenarios.

Conclusion: The proposed approach demonstrates superior performance in detecting fake fingerprints while maintaining robustness against various distortions, making it effective for secure biometric authentication systems.

Abstract: Fingerprint liveness detection systems have been affected by spoofing, which is a severe threat for fingerprint-based biometric systems. Therefore, it is crucial to develop some techniques to distinguish the fake fingerprints from the real ones. The software based techniques can detect the fingerprint forgery automatically. Also, the scheme shall be resistant against various distortions such as noise contamination, pixel missing and block missing, so that the forgers cannot deceive the detector by adding some distortions to the faked fingerprint. In this paper, we propose a fingerprint forgery detection algorithm based on a suggested adaptive thresholding pattern. The anisotropic diffusion of the input image is passed through three levels of the wavelet transform. The coefficients of different layers are adaptively thresholded and concatenated to produce the feature vector which is classified using the SVM classifier. Another contribution of the paper is to investigate the effect of various distortions such as pixel missing, block missing, and noise contamination. Our suggested approach includes a novel method that exhibits improved resistance against a range of distortions caused by environmental phenomena or manipulations by malicious users. In quantitative comparisons, our proposed method outperforms its counterparts by approximately 8% and 5% in accuracy for missing pixel scenarios of 90% and block missing scenarios of size 70x70 , respectively. This highlights the novelty approach in addressing such challenges.

[123] Fast Post-Hoc Confidence Fusion for 3-Class Open-Set Aerial Object Detection

Spyridon Loukovitis, Vasileios Karampinis, Athanasios Voulodimos

Main category: cs.CV

TL;DR: A lightweight post-processing framework for UAV navigation that enables real-time three-way classification between in-domain targets, out-of-distribution objects, and background, improving both open-set detection performance and closed-set accuracy.

DetailsMotivation: Existing UAV navigation systems lack robust open-set detection capabilities, typically using single uncertainty scores that conflate OOD objects with background clutter, limiting safety and reliability in real-world scenarios.

Method: Proposes a model-agnostic post-processing framework using a compact MLP that fuses multiple confidence estimates and per-detection features, extending binary ID/OOD classification to three-way classification (ID targets, OOD objects, background).

Result: Outperforms threshold-based baselines by 2.7% AUROC in binary classification, improves open-set mAP, and achieves up to 9-point (18%) improvement in closed-set mAP while maintaining real-time throughput.

Conclusion: The framework enables robust three-class classification critical for safe UAV navigation, where OOD objects must be avoided and background safely ignored, demonstrating superior performance across datasets without compromising efficiency.

Abstract: Developing reliable UAV navigation systems requires robust air-to-air object detectors capable of distinguishing between objects seen during training and previously unseen objects. While many methods address closed-set detection and achieve high-confidence recognition of in-domain (ID) targets, they generally do not tackle open-set detection, which requires simultaneous handling of both ID and out-of-distribution (OOD) objects. Existing open-set approaches typically rely on a single uncertainty score with thresholding, limiting flexibility and often conflating OOD objects with background clutter. In contrast, we propose a lightweight, model-agnostic post-processing framework that explicitly separates background from unknown objects while preserving the base detector’s performance. Our approach extends open-set detection beyond binary ID/OOD classification to real-time three-way classification among ID targets, OOD objects, and background. To this end, we employ a fusion scheme that aggregates multiple confidence estimates and per-detection features using a compact multilayer perceptron (MLP). Incorporating different logit variants into the MLP consistently enhances performance across both binary and three-class classification without compromising throughput. Extensive ablation and comparative experiments confirm that our method surpasses threshold-based baselines in two-class classification by an average of 2.7% AUROC, while retaining or improving open-set mAP. Furthermore, our study uniquely enables robust three-class classification, a critical capability for safe UAV navigation, where OOD objects must be actively avoided and background regions safely ignored. Comparative analysis highlights that our method surpasses competitive techniques in AUROC across datasets, while improving closed-set mAP by up to 9 points, an 18% relative gain.

[124] IPTQ-ViT: Post-Training Quantization of Non-linear Functions for Integer-only Vision Transformers

Gihwan Kim, Jemin Lee, Hyungshin Kim

Main category: cs.CV

TL;DR: IPTQ-ViT is a novel Post-Training Quantization framework that enables fully integer-only inference for vision transformers without retraining, using optimized approximation functions for GELU and Softmax layers.

DetailsMotivation: Existing QAT methods require expensive retraining for non-linear layer quantization, while PTQ methods either partially quantize non-linear functions or fail to achieve fully integer-only inference, limiting deployment in resource-constrained environments.

Method: Proposes polynomial-based GELU approximation optimized for vision data and bit-shifting-based Softmax approximation. Uses unified metric considering quantization sensitivity, perturbation, and computational cost to select optimal approximation function per activation layer.

Result: Achieves up to 6.44%p (avg. 1.78%p) top-1 accuracy improvement for image classification, 1.0 mAP for object detection. Outperforms partial floating-point PTQ methods under W8A8 and W4A8, with accuracy and latency comparable to integer-only QAT methods.

Conclusion: IPTQ-ViT enables efficient fully integer-only inference for vision transformers without retraining, making it suitable for resource-constrained deployment while maintaining competitive performance.

Abstract: Previous Quantization-Aware Training (QAT) methods for vision transformers rely on expensive retraining to recover accuracy loss in non-linear layer quantization, limiting their use in resource-constrained environments. In contrast, existing Post-Training Quantization (PTQ) methods either partially quantize non-linear functions or adjust activation distributions to maintain accuracy but fail to achieve fully integer-only inference. In this paper, we introduce IPTQ-ViT, a novel PTQ framework for fully integer-only vision transformers without retraining. We present approximation functions: a polynomial-based GELU optimized for vision data and a bit-shifting-based Softmax designed to improve approximation accuracy in PTQ. In addition, we propose a unified metric integrating quantization sensitivity, perturbation, and computational cost to select the optimal approximation function per activation layer. IPTQ-ViT outperforms previous PTQ methods, achieving up to 6.44%p (avg. 1.78%p) top-1 accuracy improvement for image classification, 1.0 mAP for object detection. IPTQ-ViT outperforms partial floating-point PTQ methods under W8A8 and W4A8, and achieves accuracy and latency comparable to integer-only QAT methods. We plan to release our code https://github.com/gihwan-kim/IPTQ-ViT.git.

[125] Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training

Yunjiao Zhou, Xinyan Chen, Junlang Qian, Lihua Xie, Jianfei Yang

Main category: cs.CV

TL;DR: ZOMG is a zero-shot, open-vocabulary framework that segments motion sequences into semantic sub-actions without annotations or fine-tuning, using language semantic partition and soft masking optimization.

DetailsMotivation: Existing methods require dense supervision with predefined action classes, which is infeasible for open-vocabulary, real-world settings where motion needs to be decomposed into fine-grained, semantic-aligned sub-actions.

Method: Integrates language semantic partition (using LLMs to decompose instructions into sub-actions) and soft masking optimization (learning instance-specific temporal masks to focus on critical frames while maintaining segment continuity and separation), without altering pretrained encoders.

Result: Achieves state-of-the-art performance on three motion-language datasets, outperforming prior methods by +8.7% mAP on HumanML3D benchmark, with significant improvements in downstream retrieval tasks.

Conclusion: Establishes a new paradigm for annotation-free motion understanding, demonstrating effective zero-shot motion grounding without requiring annotations or fine-tuning.

Abstract: Understanding complex human activities demands the ability to decompose motion into fine-grained, semantic-aligned sub-actions. This motion grounding process is crucial for behavior analysis, embodied AI and virtual reality. Yet, most existing methods rely on dense supervision with predefined action classes, which are infeasible in open-vocabulary, real-world settings. In this paper, we propose ZOMG, a zero-shot, open-vocabulary framework that segments motion sequences into semantically meaningful sub-actions without requiring any annotations or fine-tuning. Technically, ZOMG integrates (1) language semantic partition, which leverages large language models to decompose instructions into ordered sub-action units, and (2) soft masking optimization, which learns instance-specific temporal masks to focus on frames critical to sub-actions, while maintaining intra-segment continuity and enforcing inter-segment separation, all without altering the pretrained encoder. Experiments on three motion-language datasets demonstrate state-of-the-art effectiveness and efficiency of motion grounding performance, outperforming prior methods by +8.7% mAP on HumanML3D benchmark. Meanwhile, significant improvements also exist in downstream retrieval, establishing a new paradigm for annotation-free motion understanding.

[126] Breaking Expert Knowledge Limits: Self-Pruning for Large Language Models

Haidong Kang, Lihong Lin, Enneng Yang, Hongning Dai, Hao Wang

Main category: cs.CV

TL;DR: AutoPrune enables LLMs to automatically design their own pruning algorithms without expert knowledge, using Graph-driven Chain-of-Thought to optimize prompts and Skew-aware Dynamic Sparsity Allocation to address outlier value issues under high pruning ratios.

DetailsMotivation: Existing pruning methods for LLMs require manual algorithm design with huge labor costs and expert knowledge, and suffer from performance degradation under high pruning ratios due to outlier value issues caused by uniform sparsity.

Method: AutoPrune uses Graph-driven Chain-of-Thought (GCoT) to optimize prompts and enhance reasoning for learning pruning algorithms, and introduces Skew-aware Dynamic Sparsity Allocation (SDSA) to address outlier value issues by allocating sparsity adaptively.

Result: Extensive experiments on mainstream LLMs benchmarks show AutoPrune consistently outperforms state-of-the-art competitors, demonstrating superior performance and interpretability.

Conclusion: AutoPrune successfully enables LLMs to prune themselves automatically without expert knowledge, overcoming limitations of manual pruning methods and addressing critical outlier value issues under high pruning ratios.

Abstract: Large language models (LLMs) have achieved remarkable performance on a wide range of tasks, hindering real-world deployment due to their massive size. Existing pruning methods (e.g., Wanda) tailored for LLMs rely heavily on manual design pruning algorithms, thereby leading to \textit{huge labor costs} and \textit{requires expert knowledge}. Furthermore, we are the first to identify the serious \textit{outlier value issue} behind dramatic performance degradation under high pruning ratios that are caused by uniform sparsity, raising an additional concern about how to design adaptive pruning sparsity ideal for LLMs. Can LLMs prune by themselves? In this work, we introduce an affirmative answer by proposing a novel pruning method called \textbf{AutoPrune}, which first overcomes expert knowledge limits by leveraging LLMs to design optimal pruning algorithms for themselves automatically without any expert knowledge. Specifically, to mitigate the black-box nature of LLMs, we propose a Graph-driven Chain-of-Thought (GCoT) to optimize prompts, significantly enhancing the reasoning process in learning the pruning algorithm and enabling us to generate pruning algorithms with superior performance and interpretability in the next generation. Finally, grounded in insights of outlier value issue, we introduce Skew-aware Dynamic Sparsity Allocation (SDSA) to overcome the outlier value issue, mitigating performance degradation under high pruning ratios. We conduct extensive experiments on mainstream LLMs benchmarks, demonstrating the superiority of AutoPrune, which consistently excels state-of-the-art competitors. The code is available at: https://anonymous.4open.science/r/AutoPrune.

[127] ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation

Simon Boeder, Fabian Gigengack, Simon Roesler, Holger Caesar, Benjamin Risse

Main category: cs.CV

TL;DR: ShelfOcc introduces a vision-only method for 3D occupancy estimation that generates metrically consistent semantic voxel labels from video, overcoming limitations of 2D projection-based supervision without requiring LiDAR.

DetailsMotivation: To address geometric inconsistencies and depth bleeding in existing self- and weakly supervised occupancy estimation methods that rely on 2D projection or rendering-based supervision, without depending on LiDAR sensors.

Method: Generates metrically consistent semantic voxel labels from video by filtering and accumulating static geometry across frames, handling dynamic content, and propagating semantic information into stable voxel representations using vision-based 3D geometry foundation models.

Result: Substantially outperforms all previous weakly/shelf-supervised methods on Occ3D-nuScenes benchmark with up to 34% relative improvement, establishing new state-of-the-art for LiDAR-free 3D scene understanding.

Conclusion: High-quality 3D supervision is essential for robust occupancy learning and represents a complementary approach to architectural innovation, enabling use of any SOTA occupancy model architecture without LiDAR data.

Abstract: Recent progress in self- and weakly supervised occupancy estimation has largely relied on 2D projection or rendering-based supervision, which suffers from geometric inconsistencies and severe depth bleeding. We thus introduce ShelfOcc, a vision-only method that overcomes these limitations without relying on LiDAR. ShelfOcc brings supervision into native 3D space by generating metrically consistent semantic voxel labels from video, enabling true 3D supervision without any additional sensors or manual 3D annotations. While recent vision-based 3D geometry foundation models provide a promising source of prior knowledge, they do not work out of the box as a prediction due to sparse or noisy and inconsistent geometry, especially in dynamic driving scenes. Our method introduces a dedicated framework that mitigates these issues by filtering and accumulating static geometry consistently across frames, handling dynamic content and propagating semantic information into a stable voxel representation. This data-centric shift in supervision for weakly/shelf-supervised occupancy estimation allows the use of essentially any SOTA occupancy model architecture without relying on LiDAR data. We argue that such high-quality supervision is essential for robust occupancy learning and constitutes an important complementary avenue to architectural innovation. On the Occ3D-nuScenes benchmark, ShelfOcc substantially outperforms all previous weakly/shelf-supervised methods (up to a 34% relative improvement), establishing a new data-driven direction for LiDAR-free 3D scene understanding.

[128] Controlling False Positives in Image Segmentation via Conformal Prediction

Luca Mossina, Corentin Friedrich

Main category: cs.CV

TL;DR: A post-hoc framework for reliable semantic segmentation that provides statistical guarantees on false-positive predictions using conformal prediction.

DetailsMotivation: Deep segmentation models lack explicit statistical guarantees on errors, which is critical for clinical decision making where over-segmentation can have serious consequences.

Method: Constructs nested family of shrunken masks via score thresholding or morphological erosion, then uses conformal prediction on a calibration set to select shrink parameter that controls false-positive rate.

Result: Achieves target-level empirical validity on polyp-segmentation benchmark, providing finite-sample guarantees without model retraining.

Conclusion: Enables practical, risk-aware segmentation in clinical settings with guaranteed control over false positives.

Abstract: Reliable semantic segmentation is essential for clinical decision making, yet deep models rarely provide explicit statistical guarantees on their errors. We introduce a simple post-hoc framework that constructs confidence masks with distribution-free, image-level control of false-positive predictions. Given any pretrained segmentation model, we define a nested family of shrunken masks obtained either by increasing the score threshold or by applying morphological erosion. A labeled calibration set is used to select a single shrink parameter via conformal prediction, ensuring that, for new images that are exchangeable with the calibration data, the proportion of false positives retained in the confidence mask stays below a user-specified tolerance with high probability. The method is model-agnostic, requires no retraining, and provides finite-sample guarantees regardless of the underlying predictor. Experiments on a polyp-segmentation benchmark demonstrate target-level empirical validity. Our framework enables practical, risk-aware segmentation in settings where over-segmentation can have clinical consequences. Code at https://github.com/deel-ai-papers/conseco.

[129] D4C: Data-free Quantization for Contrastive Language-Image Pre-training Models

Wenlun Zhang, Yunshan Zhong, Zihao Ding, Xinyu Li, Kentaro Yoshioka

Main category: cs.CV

TL;DR: D4C is a novel Data-Free Quantization framework for CLIP models that addresses performance degradation by generating semantically rich and structurally diverse pseudo images through prompt-guided semantic injection, structural contrastive generation, and perturbation-aware enhancement.

DetailsMotivation: Existing Data-Free Quantization techniques perform poorly on Vision-Language Models like CLIP due to insufficient semantic content and low intra-image diversity in synthesized samples, creating a need for specialized DFQ methods for multimodal models.

Method: D4C uses three key components: (1) Prompt-Guided Semantic Injection for aligning images with real-world semantics, (2) Structural Contrastive Generation for reproducing natural image composition using foreground-background contrast, and (3) Perturbation-Aware Enhancement for improving diversity and robustness through controlled perturbations.

Result: D4C achieves significant performance improvements across various bit-widths and models. For W4A8 setting with CLIP ResNet-50 and ViT-B/32: 12.4% and 18.9% Top-1 accuracy improvement on CIFAR-10, 6.8% and 19.7% on CIFAR-100, and 1.4% and 5.7% on ImageNet-1K in zero-shot classification.

Conclusion: D4C effectively bridges the performance gap of Data-Free Quantization on CLIP models by synthesizing images that are both semantically informative and structurally diverse, making it a practical solution for model compression in privacy-sensitive scenarios without requiring real data access.

Abstract: Data-Free Quantization (DFQ) offers a practical solution for model compression without requiring access to real data, making it particularly attractive in privacy-sensitive scenarios. While DFQ has shown promise for unimodal models, its extension to Vision-Language Models such as Contrastive Language-Image Pre-training (CLIP) models remains underexplored. In this work, we reveal that directly applying existing DFQ techniques to CLIP results in substantial performance degradation due to two key limitations: insufficient semantic content and low intra-image diversity in synthesized samples. To tackle these challenges, we propose D4C, the first DFQ framework tailored for CLIP. D4C synthesizes semantically rich and structurally diverse pseudo images through three key components: (1) Prompt-Guided Semantic Injection aligns generated images with real-world semantics using text prompts; (2) Structural Contrastive Generation reproduces compositional structures of natural images by leveraging foreground-background contrastive synthesis; and (3) Perturbation-Aware Enhancement applies controlled perturbations to improve sample diversity and robustness. These components jointly empower D4C to synthesize images that are both semantically informative and structurally diverse, effectively bridging the performance gap of DFQ on CLIP. Extensive experiments validate the effectiveness of D4C, showing significant performance improvements on various bit-widths and models. For example, under the W4A8 setting with CLIP ResNet-50 and ViT-B/32, D4C achieves Top-1 accuracy improvement of 12.4% and 18.9% on CIFAR-10, 6.8% and 19.7% on CIFAR-100, and 1.4% and 5.7% on ImageNet-1K in zero-shot classification, respectively.

[130] WarNav: An Autonomous Driving Benchmark for Segmentation of Navigable Zones in War Scenes

Marc-Emmanuel Coupvent des Graviers, Hejer Ammar, Christophe Guettier, Yann Dumortier, Romaric Audigier

Main category: cs.CV

TL;DR: WarNav is a real-world dataset for semantic segmentation in conflict zones, addressing gaps between urban driving datasets and war-zone navigation needs for autonomous vehicles.

DetailsMotivation: To bridge the gap between conventional urban driving datasets and the unique challenges of autonomous navigation in unstructured, conflict-affected environments where unmanned systems operate.

Method: Created dataset from DATTALION repository images, established baseline results using state-of-the-art semantic segmentation models, and analyzed training data environment impacts with focus on annotation-free approaches.

Result: Provides baseline performance references for semantic segmentation in war-zone environments and demonstrates the challenges of transferring models from structured urban scenes to unstructured conflict zones.

Conclusion: WarNav enables development of robust autonomous navigation systems for high-risk scenarios while being data-efficient, fostering research that enhances safety in conflict environments.

Abstract: We introduce WarNav, a novel real-world dataset constructed from images of the open-source DATTALION repository, specifically tailored to enable the development and benchmarking of semantic segmentation models for autonomous ground vehicle navigation in unstructured, conflict-affected environments. This dataset addresses a critical gap between conventional urban driving resources and the unique operational scenarios encountered by unmanned systems in hazardous and damaged war-zones. We detail the methodological challenges encountered, ranging from data heterogeneity to ethical considerations, providing guidance for future efforts that target extreme operational contexts. To establish performance references, we report baseline results on WarNav using several state-of-the-art semantic segmentation models trained on structured urban scenes. We further analyse the impact of training data environments and propose a first step towards effective navigability in challenging environments with the constraint of having no annotation of the targeted images. Our goal is to foster impactful research that enhances the robustness and safety of autonomous vehicles in high-risk scenarios while being frugal in annotated data.

[131] Representation Space Constrained Learning with Modality Decoupling for Multimodal Object Detection

YiKang Shao, Tao Shi

Main category: cs.CV

TL;DR: This paper provides a theoretical analysis of fusion degradation in multimodal object detection and proposes RSC-MD method to address gradient suppression and modality imbalance issues.

DetailsMotivation: Most multimodal detection studies focus on fusion strategies but neglect fusion degradation and lack theoretical analysis of its causes, creating a research gap.

Method: Proposes Representation Space Constrained Learning with Modality Decoupling (RSC-MD) method with two modules: RSC to amplify suppressed gradients and MD to eliminate inter-modality coupling and imbalance.

Result: Extensive experiments on FLIR, LLVIP, M3FD, and MFAD datasets show the method effectively alleviates fusion degradation and achieves state-of-the-art performance across multiple benchmarks.

Conclusion: The proposed RSC-MD method successfully addresses fusion degradation by tackling gradient suppression and modality imbalance, enabling comprehensive optimization of each modality-specific backbone.

Abstract: Multimodal object detection has attracted significant attention in both academia and industry for its enhanced robustness. Although numerous studies have focused on improving modality fusion strategies, most neglect fusion degradation, and none provide a theoretical analysis of its underlying causes. To fill this gap, this paper presents a systematic theoretical investigation of fusion degradation in multimodal detection and identifies two key optimization deficiencies: (1) the gradients of unimodal branch backbones are severely suppressed under multimodal architectures, resulting in under-optimization of the unimodal branches; (2) disparities in modality quality cause weaker modalities to experience stronger gradient suppression, which in turn results in imbalanced modality learning. To address these issues, this paper proposes a Representation Space Constrained Learning with Modality Decoupling (RSC-MD) method, which consists of two modules. The RSC module and the MD module are designed to respectively amplify the suppressed gradients and eliminate inter-modality coupling interference as well as modality imbalance, thereby enabling the comprehensive optimization of each modality-specific backbone. Extensive experiments conducted on the FLIR, LLVIP, M3FD, and MFAD datasets demonstrate that the proposed method effectively alleviates fusion degradation and achieves state-of-the-art performance across multiple benchmarks. The code and training procedures will be released at https://github.com/yikangshao/RSC-MD.

[132] HV-Attack: Hierarchical Visual Attack for Multimodal Retrieval Augmented Generation

Linyin Luo, Yujuan Ding, Yunshan Ma, Wenqi Fan, Hanjiang Lai

Main category: cs.CV

TL;DR: Proposes a hierarchical visual attack method that disrupts MRAG systems by adding imperceptible perturbations to image inputs, causing misalignment between multimodal queries and augmented knowledge to confuse generation.

DetailsMotivation: Existing research focuses on knowledge poisoning attacks in MRAG systems, but this work explores a different vulnerability: visual attacks through imperceptible image perturbations without manipulating other components, addressing the challenge of attacking robust retrievers and generators.

Method: Hierarchical Visual Attack that misaligns multimodal query and augmented knowledge inputs to the generator. Uses a two-stage strategy: first breaks cross-modal alignment, then disrupts multimodal semantic alignment to make the retriever recall irrelevant knowledge from the original database.

Result: Extensive experiments on OK-VQA and InfoSeek datasets with CLIP-based retrievers and BLIP-2/LLaVA generators show significant decrease in both retrieval and generation performance, demonstrating attack effectiveness.

Conclusion: MRAG systems are vulnerable to visual attacks through imperceptible image perturbations, which can effectively disrupt both retrieval and generation performance by creating misalignment between multimodal inputs.

Abstract: Advanced multimodal Retrieval-Augmented Generation (MRAG) techniques have been widely applied to enhance the capabilities of Large Multimodal Models (LMMs), but they also bring along novel safety issues. Existing adversarial research has revealed the vulnerability of MRAG systems to knowledge poisoning attacks, which fool the retriever into recalling injected poisoned contents. However, our work considers a different setting: visual attack of MRAG by solely adding imperceptible perturbations at the image inputs of users, without manipulating any other components. This is challenging due to the robustness of fine-tuned retrievers and large-scale generators, and the effect of visual perturbation may be further weakened by propagation through the RAG chain. We propose a novel Hierarchical Visual Attack that misaligns and disrupts the two inputs (the multimodal query and the augmented knowledge) of MRAG’s generator to confuse its generation. We further design a hierarchical two-stage strategy to obtain misaligned augmented knowledge. We disrupt the image input of the retriever to make it recall irrelevant knowledge from the original database, by optimizing the perturbation which first breaks the cross-modal alignment and then disrupts the multimodal semantic alignment. We conduct extensive experiments on two widely-used MRAG datasets: OK-VQA and InfoSeek. We use CLIP-based retrievers and two LMMs BLIP-2 and LLaVA as generators. Results demonstrate the effectiveness of our visual attack on MRAG through the significant decrease in both retrieval and generation performance.

[133] A Dataset and Baseline for Deep Learning-Based Visual Quality Inspection in Remanufacturing

Johannes C. Bauer, Paul Geng, Stephan Trattnig, Petr Dokládal, Rüdiger Daub

Main category: cs.CV

TL;DR: Proposes a new image dataset for gearbox component inspection and a contrastive regularization loss to improve generalization in remanufacturing quality control.

DetailsMotivation: Manual quality inspection in remanufacturing is inefficient due to high part variety and defect patterns; deep learning struggles with generalization to new components and defects.

Method: Created a novel image dataset of gearbox components from two automotive transmissions, established different train-test splits to simulate distribution shifts, and proposed contrastive regularization loss for better generalization.

Result: The contrastive regularization loss demonstrated improved model generalization to unseen component types in the evaluation.

Conclusion: The proposed approach and dataset effectively address generalization challenges in automated visual inspection for remanufacturing processes.

Abstract: Remanufacturing describes a process where worn products are restored to like-new condition and it offers vast ecological and economic potentials. A key step is the quality inspection of disassembled components, which is mostly done manually due to the high variety of parts and defect patterns. Deep neural networks show great potential to automate such visual inspection tasks but struggle to generalize to new product variants, components, or defect patterns. To tackle this challenge, we propose a novel image dataset depicting typical gearbox components in good and defective condition from two automotive transmissions. Depending on the train-test split of the data, different distribution shifts are generated to benchmark the generalization ability of a classification model. We evaluate different models using the dataset and propose a contrastive regularization loss to enhance model robustness. The results obtained demonstrate the ability of the loss to improve generalisation to unseen types of components.

[134] Driving in Spikes: An Entropy-Guided Object Detector for Spike Cameras

Ziyan Liu, Qi Su, Lulu Tang, Zhaofei Yu, Tiejun Huang

Main category: cs.CV

TL;DR: EASD is an end-to-end spike camera detector for autonomous driving that handles motion blur and lighting extremes using dual branches for global semantics and object details, with a new benchmark DSEC Spike.

DetailsMotivation: Object detection in autonomous driving faces challenges from motion blur and saturation under fast motion and extreme lighting conditions. Spike cameras offer microsecond latency and ultra high dynamic range but their sparse, discrete output cannot be processed by standard image-based detectors.

Method: Proposed EASD with dual branch design: Temporal Based Texture plus Feature Fusion branch for global cross slice semantics, and Entropy Selective Attention branch for object-centric details. Also introduced DSEC Spike benchmark to address data gap.

Result: The paper presents EASD as a solution for end-to-end spike stream detection, overcoming limitations of standard detectors with spike camera data.

Conclusion: EASD enables effective object detection using spike cameras in autonomous driving scenarios with motion blur and extreme lighting conditions through its dual branch architecture and the new DSEC Spike benchmark.

Abstract: Object detection in autonomous driving suffers from motion blur and saturation under fast motion and extreme lighting. Spike cameras, offer microsecond latency and ultra high dynamic range for object detection by using per pixel asynchronous integrate and fire. However, their sparse, discrete output cannot be processed by standard image-based detectors, posing a critical challenge for end to end spike stream detection. We propose EASD, an end to end spike camera detector with a dual branch design: a Temporal Based Texture plus Feature Fusion branch for global cross slice semantics, and an Entropy Selective Attention branch for object centric details. To close the data gap, we introduce DSEC Spike, the first driving oriented simulated spike detection benchmark.

[135] SIGMMA: Hierarchical Graph-Based Multi-Scale Multi-modal Contrastive Alignment of Histopathology Image and Spatial Transcriptome

Dabin Jeong, Amirhossein Vahidi, Ciro Ramírez-Suástegui, Marie Moullet, Kevin Ly, Mohammad Vali Sanian, Sebastian Birk, Yinshui Chang, Adam Boxall, Daniyal Jafree, Lloyd Steele, Vijaya Baskar MS, Muzlifah Haniffa, Mohammad Lotfollahi

Main category: cs.CV

TL;DR: Sigmma is a multi-modal contrastive alignment framework that learns hierarchical representations of HE images and spatial transcriptome profiles across multiple scales, improving gene-expression prediction and cross-modal retrieval.

DetailsMotivation: Existing approaches align HE tiles with ST profiles at a single scale, overlooking fine-grained cellular structures and their spatial organization, which limits the ability to capture detailed cell-cell interactions in tissue microenvironments.

Method: Proposes multi-scale contrastive alignment to ensure coherent representations across modalities at different scales, and represents cell interactions as a graph integrating inter- and intra-subgraph relationships to capture cell-cell interactions from fine to coarse levels.

Result: Achieves avg. 9.78% improvement in gene-expression prediction task and avg. 26.93% improvement in cross-modal retrieval task across datasets, and learns meaningful multi-tissue organization in downstream analyses.

Conclusion: Sigmma effectively captures hierarchical tissue structures and cell interactions across multiple scales, demonstrating superior performance in computational pathology tasks compared to single-scale alignment approaches.

Abstract: Recent advances in computational pathology have leveraged vision-language models to learn joint representations of Hematoxylin and Eosin (HE) images with spatial transcriptomic (ST) profiles. However, existing approaches typically align HE tiles with their corresponding ST profiles at a single scale, overlooking fine-grained cellular structures and their spatial organization. To address this, we propose Sigmma, a multi-modal contrastive alignment framework for learning hierarchical representations of HE images and spatial transcriptome profiles across multiple scales. Sigmma introduces multi-scale contrastive alignment, ensuring that representations learned at different scales remain coherent across modalities. Furthermore, by representing cell interactions as a graph and integrating inter- and intra-subgraph relationships, our approach effectively captures cell-cell interactions, ranging from fine to coarse, within the tissue microenvironment. We demonstrate that Sigmm learns representations that better capture cross-modal correspondences, leading to an improvement of avg. 9.78% in the gene-expression prediction task and avg. 26.93% in the cross-modal retrieval task across datasets. We further show that it learns meaningful multi-tissue organization in downstream analyses.

[136] Deep Learning for Accurate Vision-based Catch Composition in Tropical Tuna Purse Seiners

Xabier Lekunberri, Ahmad Kamal, Izaro Goienetxea, Jon Ruiz, Iñaki Quincoces, Jaime Valls Miro, Ignacio Arganda-Carreras, Jose A. Fernandes-Salvador

Main category: cs.CV

TL;DR: AI system for tuna species identification from electronic monitoring videos, using segmentation and hierarchical classification to distinguish bigeye vs yellowfin tuna with 84.8% success rate and 4.5% error.

DetailsMotivation: Purse seiners catch 69% of tropical tuna, creating massive video data from electronic monitoring that requires AI assistance for species identification, particularly challenging for distinguishing bigeye from yellowfin tuna.

Method: Multi-stage pipeline with three segmentation approaches (Mask R-CNN, DINOv2+SAM2, YOLOv9+SAM2), ByteTrack for tracking, and hierarchical classification vs standard multiclass, validated on known catch composition data.

Result: YOLOv9-SAM2 performed best with 0.66 mAP and 0.88 recall. Hierarchical classification showed superior generalization. Combined approach achieved 84.8% segmentation/classification success with 4.5% mean average error.

Conclusion: The integrated YOLOv9-SAM2 segmentation with hierarchical classification provides effective automated species identification for tuna fisheries monitoring, addressing expert identification challenges and reducing human workload.

Abstract: Purse seiners play a crucial role in tuna fishing, as approximately 69% of the world’s tropical tuna is caught using this gear. All tuna Regional Fisheries Management Organizations have established minimum standards to use electronic monitoring (EM) in fisheries in addition to traditional observers. The EM systems produce a massive amount of video data that human analysts must process. Integrating artificial intelligence (AI) into their workflow can decrease that workload and improve the accuracy of the reports. However, species identification still poses significant challenges for AI, as achieving balanced performance across all species requires appropriate training data. Here, we quantify the difficulty experts face to distinguish bigeye tuna (BET, Thunnus Obesus) from yellowfin tuna (YFT, Thunnus Albacares) using images captured by EM systems. We found inter-expert agreements of 42.9% $\pm$ 35.6% for BET and 57.1% $\pm$ 35.6% for YFT. We then present a multi-stage pipeline to estimate the species composition of the catches using a reliable ground-truth dataset based on identifications made by observers on board. Three segmentation approaches are compared: Mask R-CNN, a combination of DINOv2 with SAM2, and a integration of YOLOv9 with SAM2. We found that the latest performs the best, with a validation mean average precision of 0.66 $\pm$ 0.03 and a recall of 0.88 $\pm$ 0.03. Segmented individuals are tracked using ByteTrack. For classification, we evaluate a standard multiclass classification model and a hierarchical approach, finding a superior generalization by the hierarchical. All our models were cross-validated during training and tested on fishing operations with fully known catch composition. Combining YOLOv9-SAM2 with the hierarchical classification produced the best estimations, with 84.8% of the individuals being segmented and classified with a mean average error of 4.5%.

[137] RS-CA-HSICT: A Residual and Spatial Channel Augmented CNN Transformer Framework for Monkeypox Detection

Rashid Iqbal, Saddam Hussain Khan

Main category: cs.CV

TL;DR: A hybrid CNN-Transformer architecture called RS-CA-HSICT that combines CNN and Transformer strengths for enhanced MPox detection, achieving 98.30% accuracy and 98.13% F1-score.

DetailsMotivation: To leverage the complementary strengths of CNNs (local feature extraction) and Transformers (global contextual modeling) for improved MPox detection by capturing both detailed lesion information and long-range dependencies.

Method: Proposes RS-CA-HSICT framework with HSICT block integrating CNN and Transformer, residual CNN module, spatial CNN block, and channel augmentation. Uses multihead attention, structured CNN layers, inverse residual learning, and spatial attention mechanisms for feature refinement.

Result: Achieved state-of-the-art performance with 98.30% classification accuracy and 98.13% F1-score on both Kaggle benchmark and diverse MPox datasets, outperforming existing CNNs and Vision Transformers.

Conclusion: The hybrid CNN-Transformer approach effectively captures multi-scale features, global-local dependencies, and subtle texture variations for superior MPox detection compared to standalone CNN or Transformer models.

Abstract: This work proposes a hybrid deep learning approach, namely Residual and Spatial Learning based Channel Augmented Integrated CNN-Transformer architecture, that leverages the strengths of CNN and Transformer towards enhanced MPox detection. The proposed RS-CA-HSICT framework is composed of an HSICT block, a residual CNN module, a spatial CNN block, and a CA, which enhances the diverse feature space, detailed lesion information, and long-range dependencies. The new HSICT module first integrates an abstract representation of the stem CNN and customized ICT blocks for efficient multihead attention and structured CNN layers with homogeneous (H) and structural (S) operations. The customized ICT blocks learn global contextual interactions and local texture extraction. Additionally, H and S layers learn spatial homogeneity and fine structural details by reducing noise and modeling complex morphological variations. Moreover, inverse residual learning enhances vanishing gradient, and stage-wise resolution reduction ensures scale invariance. Furthermore, the RS-CA-HSICT framework augments the learned HSICT channels with the TL-driven Residual and Spatial CNN maps for enhanced multiscale feature space capturing global and localized structural cues, subtle texture, and contrast variations. These channels, preceding augmentation, are refined through the Channel-Fusion-and-Attention block, which preserves discriminative channels while suppressing redundant ones, thereby enabling efficient computation. Finally, the spatial attention mechanism refines pixel selection to detect subtle patterns and intra-class contrast variations in Mpox. Experimental results on both the Kaggle benchmark and a diverse MPox dataset reported classification accuracy as high as 98.30% and an F1-score of 98.13%, which outperforms the existing CNNs and ViTs.

[138] FunnyNodules: A Customizable Medical Dataset Tailored for Evaluating Explainable AI

Luisa Gallée, Yiheng Xiong, Meinrad Beer, Michael Götz

Main category: cs.CV

TL;DR: FunnyNodules is a synthetic dataset for evaluating explainable AI in medical imaging, featuring controllable lung nodule attributes with full ground truth for systematic analysis of attribute-based reasoning.

DetailsMotivation: There is a scarcity of densely annotated medical image datasets that capture diagnostic reasoning, which is essential for developing AI models that reason like radiologists and make correct predictions for the right reasons.

Method: Created a fully parameterized synthetic dataset generating abstract lung nodule-like shapes with controllable visual attributes (roundness, margin sharpness, spiculation), where target class is derived from predefined attribute combinations, allowing full control over decision rules.

Result: The dataset enables model-agnostic evaluations to assess whether models learn correct attribute-target relations, interpret attribute prediction performance, and analyze attention alignment with attribute-specific regions of interest.

Conclusion: FunnyNodules provides a versatile foundation with complete ground truth for developing, benchmarking, and conducting in-depth analyses of explainable AI methods in medical image analysis.

Abstract: Densely annotated medical image datasets that capture not only diagnostic labels but also the underlying reasoning behind these diagnoses are scarce. Such reasoning-related annotations are essential for developing and evaluating explainable AI (xAI) models that reason similarly to radiologists: making correct predictions for the right reasons. To address this gap, we introduce FunnyNodules, a fully parameterized synthetic dataset designed for systematic analysis of attribute-based reasoning in medical AI models. The dataset generates abstract, lung nodule-like shapes with controllable visual attributes such as roundness, margin sharpness, and spiculation. Target class is derived from a predefined attribute combination, allowing full control over the decision rule that links attributes to the diagnostic class. We demonstrate how FunnyNodules can be used in model-agnostic evaluations to assess whether models learn correct attribute-target relations, to interpret over- or underperformance in attribute prediction, and to analyze attention alignment with attribute-specific regions of interest. The framework is fully customizable, supporting variations in dataset complexity, target definitions, class balance, and beyond. With complete ground truth information, FunnyNodules provides a versatile foundation for developing, benchmarking, and conducting in-depth analyses of explainable AI methods in medical image analysis.

[139] Evaluating Low-Light Image Enhancement Across Multiple Intensity Levels

Maria Pilligua, David Serrano-Lozano, Pai Peng, Ramon Baldrich, Michael S. Brown, Javier Vazquez-Corral

Main category: cs.CV

TL;DR: The paper introduces the Multi-Illumination Low-Light (MILL) dataset to address limitations in current low-light enhancement methods that rely on single-condition training data, enabling comprehensive evaluation across varying illumination intensities.

DetailsMotivation: Current learning-based low-light enhancement methods lack radiance diversity as they rely on paired training data captured under single low-light conditions, limiting understanding of performance across varying illumination intensities.

Method: Created the MILL dataset with images captured at diverse light intensities under controlled conditions with fixed camera settings and precise illuminance measurements. Proposed improvements leveraging the multi-illumination structure to enhance robustness across illumination scenarios.

Result: Benchmarked state-of-the-art methods revealed significant performance variations across intensity levels. Proposed modifications achieved up to 10 dB PSNR improvement for DSLR and 2 dB for smartphone on Full HD images.

Conclusion: The MILL dataset enables comprehensive evaluation of enhancement algorithms across variable lighting conditions, and the proposed multi-illumination approach significantly improves robustness and performance across diverse illumination scenarios.

Abstract: Imaging in low-light environments is challenging due to reduced scene radiance, which leads to elevated sensor noise and reduced color saturation. Most learning-based low-light enhancement methods rely on paired training data captured under a single low-light condition and a well-lit reference. The lack of radiance diversity limits our understanding of how enhancement techniques perform across varying illumination intensities. We introduce the Multi-Illumination Low-Light (MILL) dataset, containing images captured at diverse light intensities under controlled conditions with fixed camera settings and precise illuminance measurements. MILL enables comprehensive evaluation of enhancement algorithms across variable lighting conditions. We benchmark several state-of-the-art methods and reveal significant performance variations across intensity levels. Leveraging the unique multi-illumination structure of our dataset, we propose improvements that enhance robustness across diverse illumination scenarios. Our modifications achieve up to 10 dB PSNR improvement for DSLR and 2 dB for the smartphone on Full HD images.

[140] WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, Li Yuan

Main category: cs.CV

TL;DR: WISE is a benchmark for evaluating world knowledge integration in text-to-image models, featuring 1000 prompts across 25 subdomains and introducing WiScore metric to assess knowledge-image alignment.

DetailsMotivation: Existing T2I evaluation focuses on realism and basic text-image alignment, lacking comprehensive assessment of complex semantic understanding and world knowledge integration.

Method: Created WISE benchmark with 1000 structured prompts across 25 subdomains in cultural common sense, spatio-temporal reasoning, and natural science. Introduced WiScore metric to overcome CLIP limitations.

Result: Comprehensive testing of 20 models revealed significant limitations in their ability to effectively integrate and apply world knowledge during image generation.

Conclusion: The findings highlight critical pathways for enhancing knowledge incorporation and application in next-generation T2I models.

Abstract: Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text-to-image generation. To address this challenge, we propose \textbf{WISE}, the first benchmark specifically designed for \textbf{W}orld Knowledge-\textbf{I}nformed \textbf{S}emantic \textbf{E}valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 subdomains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce \textbf{WiScore}, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at \href{https://github.com/PKU-YuanGroup/WISE}{PKU-YuanGroup/WISE}.

[141] Learning to Expand Images for Efficient Visual Autoregressive Modeling

Ruiqing Yang, Kaixin Zhang, Zheng Zhang, Shan You, Tao Huang

Main category: cs.CV

TL;DR: EAR introduces a biologically inspired autoregressive image generation method that expands tokens in a spiral pattern from center outward, enabling efficient parallel decoding and achieving state-of-the-art fidelity-efficiency trade-offs.

DetailsMotivation: Existing autoregressive visual generation models suffer from inefficiency due to token-by-token decoding or complex multi-scale representations, motivating a more efficient approach inspired by human visual perception.

Method: Proposes Expanding Autoregressive Representation (EAR) that unfolds image tokens in spiral order from center outward, with length-adaptive decoding that dynamically adjusts tokens per step for parallel processing.

Result: Extensive experiments on ImageNet show EAR achieves state-of-the-art trade-offs between fidelity and efficiency for single-scale autoregressive models.

Conclusion: EAR sets a new direction for scalable and cognitively aligned autoregressive image generation by aligning generation order with perceptual relevance and reducing computational costs.

Abstract: Autoregressive models have recently shown great promise in visual generation by leveraging discrete token sequences akin to language modeling. However, existing approaches often suffer from inefficiency, either due to token-by-token decoding or the complexity of multi-scale representations. In this work, we introduce Expanding Autoregressive Representation (EAR), a novel generation paradigm that emulates the human visual system’s center-outward perception pattern. EAR unfolds image tokens in a spiral order from the center and progressively expands outward, preserving spatial continuity and enabling efficient parallel decoding. To further enhance flexibility and speed, we propose a length-adaptive decoding strategy that dynamically adjusts the number of tokens predicted at each step. This biologically inspired design not only reduces computational cost but also improves generation quality by aligning the generation order with perceptual relevance. Extensive experiments on ImageNet demonstrate that EAR achieves state-of-the-art trade-offs between fidelity and efficiency on single-scale autoregressive models, setting a new direction for scalable and cognitively aligned autoregressive image generation.

[142] Multi-Text Guided Few-Shot Semantic Segmentation

Qiang Jiao, Bin Yan, Yi Yang, Mengrui Shi, Qiang Zhang

Main category: cs.CV

TL;DR: MTGNet improves few-shot semantic segmentation by using multiple textual prompts instead of a single prompt, addressing incomplete target activation and semantic diversity issues through cross-modal optimization.

DetailsMotivation: Single textual prompts in CLIP-based methods fail to capture semantic diversity of complex categories, leading to incomplete target activation and vulnerability to noisy support features.

Method: Proposes MTGNet with three modules: Multi-Textual Prior Refinement (MTPR) for enhanced foreground activation, Text Anchor Feature Fusion (TAFF) for semantic consistency, and Foreground Confidence-Weighted Attention (FCWA) for visual prior robustness.

Result: Achieves 76.8% mIoU on PASCAL-5i and 57.4% on COCO-20i in 1-shot setting, with notable improvements in folds with high intra-class variations.

Conclusion: MTGNet effectively addresses limitations of single-prompt approaches by leveraging multiple textual prompts and cross-modal optimization, significantly improving few-shot semantic segmentation performance.

Abstract: Recent CLIP-based few-shot semantic segmentation methods introduce class-level textual priors to assist segmentation by typically using a single prompt (e.g., a photo of class). However, these approaches often result in incomplete activation of target regions, as a single textual description cannot fully capture the semantic diversity of complex categories. Moreover, they lack explicit cross-modal interaction and are vulnerable to noisy support features, further degrading visual prior quality. To address these issues, we propose the Multi-Text Guided Few-Shot Semantic Segmentation Network (MTGNet), a dual-branch framework that enhances segmentation performance by fusing diverse textual prompts to refine textual priors and guide the cross-modal optimization of visual priors. Specifically, we design a Multi-Textual Prior Refinement (MTPR) module that suppresses interference and aggregates complementary semantic cues to enhance foreground activation and expand semantic coverage for structurally complex objects. We introduce a Text Anchor Feature Fusion (TAFF) module, which leverages multi-text embeddings as semantic anchors to facilitate the transfer of discriminative local prototypes from support images to query images, thereby improving semantic consistency and alleviating intra-class variations. Furthermore, a Foreground Confidence-Weighted Attention (FCWA) module is presented to enhance visual prior robustness by leveraging internal self-similarity within support foreground features. It adaptively down-weights inconsistent regions and effectively suppresses interference in the query segmentation process. Extensive experiments on standard FSS benchmarks validate the effectiveness of MTGNet. In the 1-shot setting, it achieves 76.8% mIoU on PASCAL-5i and 57.4% on COCO-20i, with notable improvements in folds exhibiting high intra-class variations.

[143] A Hybrid CNN-ViT-GNN Framework with GAN-Based Augmentation for Intelligent Weed Detection in Precision Agriculture

Pandiyaraju V, Abishek Karthik, Sreya Mynampati, Poovarasan L, D. Saraswathi

Main category: cs.CV

TL;DR: Hybrid deep learning framework combining CNNs, ViTs, and GNNs achieves 99.33% accuracy for weed detection, enabling sustainable precision agriculture through selective herbicide application.

DetailsMotivation: Accurate weed species identification is essential for precision agriculture to enable selective herbicide application and support sustainable crop management practices.

Method: Hybrid framework using CNNs, Vision Transformers, and Graph Neural Networks with GAN-based augmentation and self-supervised contrastive pre-training for robust feature learning from limited data.

Result: Achieved 99.33% accuracy, precision, recall, and F1-score on multi-benchmark datasets, demonstrating superior performance in weed detection.

Conclusion: The framework enables real-time deployment to edge devices, reduces herbicide over-reliance, and provides scalable sustainable precision farming solutions with high interpretability.

Abstract: The task of weed detection is an essential element of precision agriculture since accurate species identification allows a farmer to selectively apply herbicides and fits into sustainable agriculture crop management. This paper proposes a hybrid deep learning framework recipe for weed detection that utilizes Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Graph Neural Networks (GNNs) to build robustness to multiple field conditions. A Generative Adversarial Network (GAN)-based augmentation method was imposed to balance class distributions and better generalize the model. Further, a self-supervised contrastive pre-training method helps to learn more features from limited annotated data. Experimental results yield superior results with 99.33% accuracy, precision, recall, and F1-score on multi-benchmark datasets. The proposed model architecture enables local, global, and relational feature representations and offers high interpretability and adaptability. Practically, the framework allows real-time, efficient deployment to edge devices for automated weed detecting, reducing over-reliance on herbicides and providing scalable, sustainable precision-farming options.

[144] CompTrack: Information Bottleneck-Guided Low-Rank Dynamic Token Compression for Point Cloud Tracking

Sifan Zhou, Yichao Cao, Jiahao Nie, Yuqian Fu, Ziyu Zhao, Xiaobo Lu, Shuo Wang

Main category: cs.CV

TL;DR: CompTrack is a 3D single object tracking framework that eliminates spatial redundancy from background noise and informational redundancy within foreground points using entropy-based filtering and dynamic token compression.

DetailsMotivation: Existing 3D trackers are limited by dual-redundancy in LiDAR point clouds: spatial redundancy from background noise impairs accuracy, and informational redundancy within foreground hinders efficiency.

Method: Uses Spatial Foreground Predictor (SFP) to filter background noise based on information entropy, and Information Bottleneck-guided Dynamic Token Compression (IB-DTC) that employs online SVD analysis to compress redundant foreground into compact proxy tokens.

Result: Achieves top-performing tracking performance on KITTI, nuScenes and Waymo datasets with superior efficiency, running at 90 FPS on a single RTX 3090 GPU.

Conclusion: CompTrack effectively addresses both spatial and informational redundancy in 3D point cloud tracking, enabling real-time performance while maintaining high accuracy.

Abstract: 3D single object tracking (SOT) in LiDAR point clouds is a critical task in computer vision and autonomous driving. Despite great success having been achieved, the inherent sparsity of point clouds introduces a dual-redundancy challenge that limits existing trackers: (1) vast spatial redundancy from background noise impairs accuracy, and (2) informational redundancy within the foreground hinders efficiency. To tackle these issues, we propose CompTrack, a novel end-to-end framework that systematically eliminates both forms of redundancy in point clouds. First, CompTrack incorporates a Spatial Foreground Predictor (SFP) module to filter out irrelevant background noise based on information entropy, addressing spatial redundancy. Subsequently, its core is an Information Bottleneck-guided Dynamic Token Compression (IB-DTC) module that eliminates the informational redundancy within the foreground. Theoretically grounded in low-rank approximation, this module leverages an online SVD analysis to adaptively compress the redundant foreground into a compact and highly informative set of proxy tokens. Extensive experiments on KITTI, nuScenes and Waymo datasets demonstrate that CompTrack achieves top-performing tracking performance with superior efficiency, running at a real-time 90 FPS on a single RTX 3090 GPU.

[145] Scriboora: Rethinking Human Pose Forecasting

Daniel Bermuth, Alexander Poeppel, Wolfgang Reif

Main category: cs.CV

TL;DR: This paper addresses human pose forecasting by evaluating various algorithms, revealing reproducibility issues, and proposing a unified pipeline. It adapts speech models to improve state-of-the-art performance and assesses robustness using noisy pose estimates.

DetailsMotivation: Human pose forecasting has important applications in action recognition, autonomous driving, and human-robot interaction. The paper aims to address reproducibility issues and improve performance by leveraging insights from speech understanding models.

Method: The authors evaluate a range of pose forecasting algorithms, create a unified training and evaluation pipeline, adapt recent speech models to pose forecasting, and test robustness using noisy joint coordinates from pose estimators.

Result: Speech models adapted to pose forecasting improve state-of-the-art performance. However, estimated poses cause substantial performance degradation, though some of this can be recovered through unsupervised finetuning.

Conclusion: The paper demonstrates that speech models can be effectively adapted for pose forecasting, improving performance. It also highlights the impact of realistic noise from pose estimators and shows that unsupervised finetuning can partially mitigate performance degradation.

Abstract: Human pose forecasting predicts future poses based on past observations, and has many significant applications in areas such as action recognition, autonomous driving or human-robot interaction. This paper evaluates a wide range of pose forecasting algorithms in the task of absolute pose forecasting, revealing many reproducibility issues, and provides a unified training and evaluation pipeline. After drawing a high-level analogy to the task of speech understanding, it is shown that recent speech models can be efficiently adapted to the task of pose forecasting, and improve current state-of-the-art performance. At last the robustness of the models is evaluated, using noisy joint coordinates obtained from a pose estimator model, to reflect a realistic type of noise, which is more close to real-world applications. For this a new dataset variation is introduced, and it is shown that estimated poses result in a substantial performance degradation, and how much of it can be recovered again by unsupervised finetuning.

[146] Transferable Dual-Domain Feature Importance Attack against AI-Generated Image Detector

Weiheng Zhu, Gang Cao, Jing Liu, Lifang Yu, Shaowei Weng

Main category: cs.CV

TL;DR: Proposes DuFIA, a dual-domain adversarial attack method that combines spatial and frequency features to fool AI-generated image detectors with high transferability.

DetailsMotivation: To develop advanced adversarial attacks for evaluating the security of AI-generated image detectors, which remain insufficiently explored in antiforensics.

Method: Uses spatially interpolated gradient and frequency-aware perturbation to capture forensically important features, then fuses spatial and frequency-domain feature importances to guide adversarial example generation.

Result: Extensive experiments show DuFIA achieves cross-model transferability, transparency and robustness across various AIGI detectors.

Conclusion: DuFIA successfully invalidates AIGI detectors to some extent by leveraging dual-domain feature importance modeling.

Abstract: Recent AI-generated image (AIGI) detectors achieve impressive accuracy under clean condition. In view of antiforensics, it is significant to develop advanced adversarial attacks for evaluating the security of such detectors, which remains unexplored sufficiently. This letter proposes a Dual-domain Feature Importance Attack (DuFIA) scheme to invalidate AIGI detectors to some extent. Forensically important features are captured by the spatially interpolated gradient and frequency-aware perturbation. The adversarial transferability is enhanced by jointly modeling spatial and frequency-domain feature importances, which are fused to guide the optimization-based adversarial example generation. Extensive experiments across various AIGI detectors verify the cross-model transferability, transparency and robustness of DuFIA.

[147] The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification

Dante Francisco Wasmuht, Otto Brookes, Maximillian Schall, Pablo Palencia, Chris Beirne, Tilo Burghardt, Majid Mirmehdi, Hjalmar Kühl, Mimi Arandjelovic, Sam Pottie, Peter Bermant, Brandon Asheim, Yi Jin Toh, Adam Elzinga, Jason Holmberg, Andrew Whitworth, Eleanor Flatt, Laura Gustafson, Chaitanya Ryali, Yuan-Ting Hu, Baishan Guo, Andrew Westbury, Kate Saenko, Didac Suris

Main category: cs.CV

TL;DR: SA-FARI is the largest open-source multi-animal tracking dataset for wildlife conservation, featuring 11,609 camera trap videos from 741 locations across 4 continents, spanning 99 species with comprehensive spatio-temporal annotations.

DetailsMotivation: Existing datasets for multi-animal tracking are limited in scale, species diversity, and geographical coverage, lacking suitable benchmarks for training general-purpose models applicable across wild animal populations.

Method: Collected 11,609 camera trap videos over 10 years (2014-2024) from 741 locations across 4 continents, exhaustively annotated with 16,224 masklet identities, 942,702 bounding boxes, segmentation masks, and species labels.

Result: Created the largest open-source MAT dataset with ~46 hours of densely annotated footage, providing comprehensive benchmarks using state-of-the-art vision-language models including SAM 3, evaluated with species-specific and generic animal prompts.

Conclusion: SA-FARI is the first large-scale dataset combining high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multi-animal tracking in the wild.

Abstract: Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity - leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in ~46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multianimal tracking in the wild. The dataset is available at $\href{https://www.conservationxlabs.com/sa-fari}{\text{conservationxlabs.com/SA-FARI}}$.

[148] Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks

Shijie Lian, Changti Wu, Laurence Tianruo Yang, Hang Yuan, Bin Yu, Lei Zhang, Kai Chen

Main category: cs.CV

TL;DR: Fine-tuning MLLMs on Euclidean geometry problems (Euclid30K dataset) enables transferable spatial reasoning skills across multiple benchmarks without task-specific adaptations.

DetailsMotivation: Spatial intelligence remains a critical unresolved challenge for Multimodal Large Language Models (MLLMs), including abilities like visualizing shapes, mental rotation, judging positions, and estimating numerosity.

Method: Created Euclid30K dataset with ~30K plane and solid geometry problems, then fine-tuned 7 model variants (3-72B parameters) from Qwen2.5VL, Qwen3VL, and RoboBrain2.0 families using Group Relative Policy Optimization (GRPO) to learn Euclidean principles.

Result: Models achieved substantial zero-shot gains: VSI-Bench accuracy increased from 36.6% to 41.8% (+5.2%), MindCube accuracy from 31.4% to 38.1% (+6.7%) across four spatial reasoning benchmarks without task-specific adaptations.

Conclusion: First systematic study showing geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills, demonstrating the effectiveness of Euclidean geometry as a surrogate task for spatial reasoning.

Abstract: Spatial intelligence spans a rich suite of abilities, including visualising and transforming shapes, mentally rotating objects, judging relational positions and containment, and estimating numerosity. However, it still remains a critical unresolved challenge for Multimodal Large Language Models (MLLMs). To fill this gap, we propose to treat Euclidean geometry problem-solving as a surrogate task. Specifically, we meticulously constructed a curated multimodal dataset, called Euclid30K, comprising approximately 30K plane and solid geometry problems. Furthermore, to enable the model to learn and apply Euclidean principles from these geometry problems, we fine-tuned seven model variants (spanning 3–72B parameters) from the Qwen2.5VL, Qwen3VL, and RoboBrain2.0 families using Group Relative Policy Optimization (GRPO), inspiring the models to identify shapes, count, and relate entities, and perform multi-step deductive reasoning using Euclidean principles. Our experiments demonstrate that the resulting models achieve substantial zero-shot gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube) without any task-specific adaptations. Notably, after training on the Euclid30K, the mean VSI-Bench accuracy rose from 36.6% to 41.8% (+5.2%), and the mean MindCube accuracy rose from 31.4% to 38.1% (+6.7%). To our knowledge, this is the first systematic study showing that geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills. Code and Euclid30K dataset can be found in \href{https://zgca-ai4edu.github.io/Euclids_Gift}{this}.

[149] From Low-Rank Features to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers

Huiyuan Tian, Bonan Xu, Shijian Li, Xin Jin

Main category: cs.CV

TL;DR: Feature-map knowledge distillation fails for Vision Transformers due to encoding mismatch despite global low-rank structure. Token-level analysis reveals high-bandwidth encoding patterns, leading to proposed strategies that reactivate effective distillation.

DetailsMotivation: To understand why feature-map knowledge distillation works well for convolutional networks but fails for Vision Transformers, and to develop effective distillation methods for ViTs based on representation analysis.

Method: Conducted two-view representation analysis: layer-wise SVD showing global low-rank structure, and token-level Spectral Energy Pattern analysis revealing high-bandwidth encoding. Proposed two strategies: post-hoc feature lifting with lightweight projector, and native width alignment that widens only student’s last block.

Result: Successfully reactivated feature-map distillation for ViTs, raising DeiT-Tiny accuracy from 74.86% to 77.53% and 78.23% when distilling from CaiT-S24. Also improved standalone students trained without teachers.

Conclusion: The analysis explains ViT feature distillation failure and shows how exploiting low-rank structure yields effective remedies and design guidance for compact ViTs.

Abstract: Feature-map knowledge distillation (KD) is highly effective for convolutional networks but often fails for Vision Transformers (ViTs). To understand this failure and guide method design, we conduct a two-view representation analysis of ViTs. First, a layer-wise Singular Value Decomposition (SVD) of full feature matrices shows that final-layer representations are globally low-rank: for CaiT-S24, only $121/61/34/14$ dimensions suffice to capture $99%/95%/90%/80%$ of the energy. In principle, this suggests that a compact student plus a simple linear projector should be enough for feature alignment, contradicting the weak empirical performance of standard feature KD. To resolve this paradox, we introduce a token-level Spectral Energy Pattern (SEP) analysis that measures how each token uses channel capacity. SEP reveals that, despite the global low-rank structure, individual tokens distribute energy over most channels, forming a high-bandwidth encoding pattern. This results in an encoding mismatch between wide teachers and narrow students. Motivated by this insight, we propose two minimal, mismatch-driven strategies: (1) post-hoc feature lifting with a lightweight projector retained during inference, or (2) native width alignment that widens only the student’s last block to the teacher’s width. On ImageNet-1K, these strategies reactivate simple feature-map distillation in ViTs, raising DeiT-Tiny accuracy from $74.86%$ to $77.53%$ and $78.23%$ when distilling from CaiT-S24, while also improving standalone students trained without any teacher. Our analysis thus explains why ViT feature distillation fails and shows how exploiting low-rank structure yields effective, interpretable remedies and concrete design guidance for compact ViTs.

[150] Conflict Adaptation in Vision-Language Models

Xiaoyang Hu

Main category: cs.CV

TL;DR: Vision-language models exhibit human-like conflict adaptation in Stroop tasks, with neural analysis revealing specialized supernodes that handle text/color processing and conflict modulation.

DetailsMotivation: To investigate whether AI models demonstrate human-like cognitive control mechanisms like conflict adaptation, and understand the neural basis of this behavior in vision-language models.

Method: Used sequential Stroop task to test 13 VLMs, then employed sparse autoencoders (SAEs) to analyze task-relevant supernodes in InternVL 3.5 4B model, including ablation studies.

Result: 12 of 13 VLMs showed conflict adaptation behavior; SAEs revealed partially overlapping text/color supernodes in early/late layers, and identified a conflict-modulated supernode in layers 24-25 whose ablation increased Stroop errors.

Conclusion: VLMs can exhibit human-like cognitive control patterns, with specialized neural representations that mirror human automaticity asymmetries and conflict processing mechanisms.

Abstract: A signature of human cognitive control is conflict adaptation: improved performance on a high-conflict trial following another high-conflict trial. This phenomenon offers an account for how cognitive control, a scarce resource, is recruited. Using a sequential Stroop task, we find that 12 of 13 vision-language models (VLMs) tested exhibit behavior consistent with conflict adaptation, with the lone exception likely reflecting a ceiling effect. To understand the representational basis of this behavior, we use sparse autoencoders (SAEs) to identify task-relevant supernodes in InternVL 3.5 4B. Partially overlapping supernodes emerge for text and color in both early and late layers, and their relative sizes mirror the automaticity asymmetry between reading and color naming in humans. We further isolate a conflict-modulated supernode in layers 24-25 whose ablation significantly increases Stroop errors while minimally affecting congruent trials.

[151] AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning

Urjitkumar Patel, Fang-Chun Yeh, Chinmay Gondhalekar

Main category: cs.CV

TL;DR: AVATAAR is a modular framework for long-form video QA that combines global and local context with iterative reasoning through a feedback loop between a Pre Retrieval Thinking Agent and Rethink Module, achieving significant performance gains on the CinePile benchmark.

DetailsMotivation: Current large vision language models struggle with nuanced queries requiring comprehensive understanding and detailed analysis of long-form videos, necessitating a more sophisticated approach to video question answering.

Method: AVATAAR uses a modular framework with persistent global video summaries, local context analysis, a Pre Retrieval Thinking Agent, and a Rethink Module that creates a feedback loop for iterative reasoning and strategy refinement.

Result: On CinePile benchmark: +5.6% in temporal reasoning, +5% in technical queries, +8% in theme-based questions, +8.2% in narrative comprehension. Each module contributes positively, with the feedback loop being crucial for adaptability.

Conclusion: AVATAAR provides a scalable solution for long-form Video QA that effectively combines accuracy, interpretability, and extensibility, demonstrating significant improvements in video understanding capabilities.

Abstract: With the increasing prevalence of video content, effectively understanding and answering questions about long form videos has become essential for numerous applications. Although large vision language models (LVLMs) have enhanced performance, they often face challenges with nuanced queries that demand both a comprehensive understanding and detailed analysis. To overcome these obstacles, we introduce AVATAAR, a modular and interpretable framework that combines global and local video context, along with a Pre Retrieval Thinking Agent and a Rethink Module. AVATAAR creates a persistent global summary and establishes a feedback loop between the Rethink Module and the Pre Retrieval Thinking Agent, allowing the system to refine its retrieval strategies based on partial answers and replicate human-like iterative reasoning. On the CinePile benchmark, AVATAAR demonstrates significant improvements over a baseline, achieving relative gains of +5.6% in temporal reasoning, +5% in technical queries, +8% in theme-based questions, and +8.2% in narrative comprehension. Our experiments confirm that each module contributes positively to the overall performance, with the feedback loop being crucial for adaptability. These findings highlight AVATAAR’s effectiveness in enhancing video understanding capabilities. Ultimately, AVATAAR presents a scalable solution for long-form Video Question Answering (QA), merging accuracy, interpretability, and extensibility.

[152] GEO-Bench-2: From Performance to Capability, Rethinking Evaluation in Geospatial AI

Naomi Simumba, Nils Lehmann, Paolo Fraccaro, Hamed Alemohammad, Geeth De Mel, Salman Khan, Manil Maskey, Nicolas Longepe, Xiao Xiang Zhu, Hannah Kerner, Juan Bernabe-Moreno, Alexander Lacoste

Main category: cs.CV

TL;DR: GEO-Bench-2 introduces a standardized evaluation framework for Geospatial Foundation Models across 19 datasets and multiple tasks, showing no single model dominates all capabilities.

DetailsMotivation: Address the lack of standardized evaluation protocols for Geospatial Foundation Models in Earth Observation to enable fair comparison and identify areas needing improvement.

Method: Comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation across 19 datasets, with capability groups to rank models based on shared characteristics.

Result: No single model dominates across all tasks; natural image models excel on high-resolution tasks while EO-specific models outperform on multispectral applications like agriculture and disaster response.

Conclusion: Optimal model choice depends on task requirements and data modalities, and the goal of a single GeoFM that performs well across all tasks remains open for future research.

Abstract: Geospatial Foundation Models (GeoFMs) are transforming Earth Observation (EO), but evaluation lacks standardized protocols. GEO-Bench-2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation across 19 permissively-licensed datasets. We introduce ‘‘capability’’ groups to rank models on datasets that share common characteristics (e.g., resolution, bands, temporality). This enables users to identify which models excel in each capability and determine which areas need improvement in future work. To support both fair comparison and methodological innovation, we define a prescriptive yet flexible evaluation protocol. This not only ensures consistency in benchmarking but also facilitates research into model adaptation strategies, a key and open challenge in advancing GeoFMs for downstream tasks. Our experiments show that no single model dominates across all tasks, confirming the specificity of the choices made during architecture design and pretraining. While models pretrained on natural images (ConvNext ImageNet, DINO V3) excel on high-resolution tasks, EO-specific models (TerraMind, Prithvi, and Clay) outperform them on multispectral applications such as agriculture and disaster response. These findings demonstrate that optimal model choice depends on task requirements, data modalities, and constraints. This shows that the goal of a single GeoFM model that performs well across all tasks remains open for future research. GEO-Bench-2 enables informed, reproducible GeoFM evaluation tailored to specific use cases. Code, data, and leaderboard for GEO-Bench-2 are publicly released under a permissive license.

[153] Learning from Mistakes: Loss-Aware Memory Enhanced Continual Learning for LiDAR Place Recognition

Xufei Wang, Junqiao Zhao, Siyue Tao, Qiwen Gu, Wonbong Kim, Tiantian Feng

Main category: cs.CV

TL;DR: KDF+ is a continual learning framework for LiDAR place recognition that addresses catastrophic forgetting through loss-aware sampling and rehearsal enhancement mechanisms.

DetailsMotivation: Existing LiDAR place recognition methods struggle with catastrophic forgetting when adapting to new environments while retaining previously learned knowledge.

Method: Extends KDF paradigm with loss-aware sampling (estimates learning difficulty via loss values) and rehearsal enhancement (refines memory samples during new-task training).

Result: Outperforms existing continual learning methods across multiple benchmarks and integrates seamlessly into state-of-the-art frameworks for stable performance gains.

Conclusion: KDF+ effectively mitigates catastrophic forgetting in LiDAR place recognition through intelligent sampling and rehearsal techniques, enabling better adaptation to new environments while preserving learned knowledge.

Abstract: LiDAR place recognition plays a crucial role in SLAM, robot navigation, and autonomous driving. However, existing LiDAR place recognition methods often struggle to adapt to new environments without forgetting previously learned knowledge, a challenge widely known as catastrophic forgetting. To address this issue, we propose KDF+, a novel continual learning framework for LiDAR place recognition that extends the KDF paradigm with a loss-aware sampling strategy and a rehearsal enhancement mechanism. The proposed sampling strategy estimates the learning difficulty of each sample via its loss value and selects samples for replay according to their estimated difficulty. Harder samples, which tend to encode more discriminative information, are sampled with higher probability while maintaining distributional coverage across the dataset. In addition, the rehearsal enhancement mechanism encourages memory samples to be further refined during new-task training by slightly reducing their loss relative to previous tasks, thereby reinforcing long-term knowledge retention. Extensive experiments across multiple benchmarks demonstrate that KDF+ consistently outperforms existing continual learning methods and can be seamlessly integrated into state-of-the-art continual learning for LiDAR place recognition frameworks to yield significant and stable performance gains. The code will be available at https://github.com/repo/KDF-plus.

[154] MF-GCN: A Multi-Frequency Graph Convolutional Network for Tri-Modal Depression Detection Using Eye-Tracking, Facial, and Acoustic Features

Sejuti Rahman, Swakshar Deb, MD. Sameer Iqbal Chowdhury, MD. Jubair Ahmed Sourov, Mohammad Shamsuddin

Main category: cs.CV

TL;DR: Proposes MF-GCN framework using multi-frequency graph convolutional networks for depression detection from eye tracking, audio, and video data, achieving high performance in binary and multi-class classification across datasets.

DetailsMotivation: Address limitations of existing graph-based models that focus only on low-frequency information, and leverage multiple modalities (eye tracking, audio, video) to capture depression-related features like attentional bias and affective flattening.

Method: Multi-Frequency Graph Convolutional Network (MF-GCN) with Multi-Frequency Filter Bank Module (MFFBM) that processes both low and high frequency signals from trimodal data.

Result: Binary classification: sensitivity 0.96, F2 score 0.94; 3-class classification: sensitivity 0.79, specificity 0.87; Generalization on CMDC dataset: sensitivity 0.95, F2 score 0.96 - consistently outperforms baselines.

Conclusion: The trimodal, multi-frequency framework effectively captures cross-modal interactions for accurate depression detection and demonstrates strong generalizability across datasets.

Abstract: Eye tracking data quantifies the attentional bias towards negative stimuli that is frequently observed in depressed groups. Audio and video data capture the affective flattening and psychomotor retardation characteristic of depression. Statistical validation confirmed their significant discriminative power in distinguishing depressed from non depressed groups. We address a critical limitation of existing graph-based models that focus on low-frequency information and propose a Multi-Frequency Graph Convolutional Network (MF-GCN). This framework consists of a novel Multi-Frequency Filter Bank Module (MFFBM), which can leverage both low and high frequency signals. Extensive evaluation against traditional machine learning algorithms and deep learning frameworks demonstrates that MF-GCN consistently outperforms baselines. In binary (depressed and non depressed) classification, the model achieved a sensitivity of 0.96 and F2 score of 0.94. For the 3 class (no depression, mild to moderate depression and severe depression) classification task, the proposed method achieved a sensitivity of 0.79 and specificity of 0.87 and siginificantly suprassed other models. To validate generalizability, the model was also evaluated on the Chinese Multimodal Depression Corpus (CMDC) dataset and achieved a sensitivity of 0.95 and F2 score of 0.96. These results confirm that our trimodal, multi frequency framework effectively captures cross modal interaction for accurate depression detection.

[155] US-X Complete: A Multi-Modal Approach to Anatomical 3D Shape Recovery

Miruna-Alexandra Gafencu, Yordanka Velikova, Nassir Navab, Mohammad Farid Azampour

Main category: cs.CV

TL;DR: A multi-modal deep learning method that completes occluded anatomical structures in 3D ultrasound by leveraging information from a single X-ray image, addressing ultrasound’s limitations in visualizing vertebral bodies due to bone shadowing.

DetailsMotivation: Ultrasound has limitations in visualizing complete vertebral anatomy due to acoustic shadowing effects from bone, despite being radiation-free and cost-effective for spinal procedures.

Method: Generates paired training data with 2D lateral vertebral views (simulating X-ray) and 3D partial vertebrae representations (simulating ultrasound limitations). Uses multi-modal deep learning to integrate morphological information from both ultrasound and X-ray.

Result: Significant improvements in vertebral reconstruction (p < 0.001) compared to state-of-the-art methods. Achieves accurate, complete volumetric lumbar spine visualization overlaid on ultrasound without needing registration with CT.

Conclusion: Integrating a single X-ray projection mitigates ultrasound’s key limitation while preserving its strengths as the primary imaging modality, enabling more complete spinal visualization.

Abstract: Ultrasound offers a radiation-free, cost-effective solution for real-time visualization of spinal landmarks, paraspinal soft tissues and neurovascular structures, making it valuable for intraoperative guidance during spinal procedures. However, ultrasound suffers from inherent limitations in visualizing complete vertebral anatomy, in particular vertebral bodies, due to acoustic shadowing effects caused by bone. In this work, we present a novel multi-modal deep learning method for completing occluded anatomical structures in 3D ultrasound by leveraging complementary information from a single X-ray image. To enable training, we generate paired training data consisting of: (1) 2D lateral vertebral views that simulate X-ray scans, and (2) 3D partial vertebrae representations that mimic the limited visibility and occlusions encountered during ultrasound spine imaging. Our method integrates morphological information from both imaging modalities and demonstrates significant improvements in vertebral reconstruction (p < 0.001) compared to state of art in 3D ultrasound vertebral completion. We perform phantom studies as an initial step to future clinical translation, and achieve a more accurate, complete volumetric lumbar spine visualization overlayed on the ultrasound scan without the need for registration with preoperative modalities such as computed tomography. This demonstrates that integrating a single X-ray projection mitigates ultrasound’s key limitation while preserving its strengths as the primary imaging modality. Code and data can be found at https://github.com/miruna20/US-X-Complete

[156] MaskMed: Decoupled Mask and Class Prediction for Medical Image Segmentation

Bin Xie, Gady Agam

Main category: cs.CV

TL;DR: MaskMed introduces a unified decoupled segmentation head and full-scale aware deformable transformer for medical image segmentation, achieving state-of-the-art performance.

DetailsMotivation: Traditional point-wise convolutional segmentation heads with rigid class-channel binding limit feature sharing and semantic generalization in medical image segmentation.

Method: Proposes a decoupled segmentation head separating mask prediction from class prediction using shared object queries, and a Full-Scale Aware Deformable Transformer for memory-efficient full-scale feature fusion.

Result: Achieves +2.0% Dice improvement on AMOS 2022 and +6.9% Dice improvement on BTCV compared to nnUNet.

Conclusion: The proposed MaskMed method with decoupled segmentation and full-scale fusion significantly advances medical image segmentation performance.

Abstract: Medical image segmentation typically adopts a point-wise convolutional segmentation head to predict dense labels, where each output channel is heuristically tied to a specific class. This rigid design limits both feature sharing and semantic generalization. In this work, we propose a unified decoupled segmentation head that separates multi-class prediction into class-agnostic mask prediction and class label prediction using shared object queries. Furthermore, we introduce a Full-Scale Aware Deformable Transformer module that enables low-resolution encoder features to attend across full-resolution encoder features via deformable attention, achieving memory-efficient and spatially aligned full-scale fusion. Our proposed method, named MaskMed, achieves state-of-the-art performance, surpassing nnUNet by +2.0% Dice on AMOS 2022 and +6.9% Dice on BTCV.

[157] FlashMesh: Faster and Better Autoregressive Mesh Synthesis via Structured Speculation

Tingrui Shen, Yiheng Zhang, Chen Tang, Chuan Ping, Zixing Zhao, Le Wan, Yuwang Wang, Ronggang Wang, Shengfeng He

Main category: cs.CV

TL;DR: FlashMesh accelerates 3D mesh generation using speculative decoding that exploits structural correlations in mesh data, achieving 2x speedup while improving quality.

DetailsMotivation: Autoregressive models generate high-quality 3D meshes but suffer from slow token-by-token decoding, limiting practical use in interactive and large-scale applications.

Method: Introduces FlashMesh with a predict-correct-verify paradigm using speculative decoding tailored to hourglass transformer architecture, enabling parallel prediction across face, point, and coordinate levels.

Result: Achieves up to 2x speedup over standard autoregressive models while also improving generation fidelity.

Conclusion: Structural priors in mesh data can be systematically harnessed to accelerate and enhance autoregressive generation.

Abstract: Autoregressive models can generate high-quality 3D meshes by sequentially producing vertices and faces, but their token-by-token decoding results in slow inference, limiting practical use in interactive and large-scale applications. We present FlashMesh, a fast and high-fidelity mesh generation framework that rethinks autoregressive decoding through a predict-correct-verify paradigm. The key insight is that mesh tokens exhibit strong structural and geometric correlations that enable confident multi-token speculation. FlashMesh leverages this by introducing a speculative decoding scheme tailored to the commonly used hourglass transformer architecture, enabling parallel prediction across face, point, and coordinate levels. Extensive experiments show that FlashMesh achieves up to a 2 x speedup over standard autoregressive models while also improving generation fidelity. Our results demonstrate that structural priors in mesh data can be systematically harnessed to accelerate and enhance autoregressive generation.

[158] Hierarchical Semantic Tree Anchoring for CLIP-Based Class-Incremental Learning

Tao Hu, Lan Li, Zhen-Hao Xie, Da-Wei Zhou

Main category: cs.CV

TL;DR: HASTEN introduces hierarchical semantic tree anchoring to reduce catastrophic forgetting in class-incremental learning by embedding features in hyperbolic space using external knowledge graphs and projecting gradients to prevent interference.

DetailsMotivation: Existing CLIP-based CIL methods fail to capture inherent hierarchical relationships in visual and linguistic concepts, leading to fine-grained class feature drift and catastrophic forgetting during incremental updates.

Method: 1) Use external knowledge graph as supervision to embed visual/textual features in hyperbolic space to preserve hierarchical structure; 2) Project gradients onto null space of shared hyperbolic mapper to prevent interference with prior tasks.

Result: Extensive experiments show HASTEN consistently outperforms existing methods while providing unified structured representation.

Conclusion: HASTEN effectively reduces catastrophic forgetting in CIL by maintaining hierarchical relationships through hyperbolic embeddings and gradient projection techniques.

Abstract: Class-Incremental Learning (CIL) enables models to learn new classes continually while preserving past knowledge. Recently, vision-language models like CLIP offer transferable features via multi-modal pre-training, making them well-suited for CIL. However, real-world visual and linguistic concepts are inherently hierarchical: a textual concept like “dog” subsumes fine-grained categories such as “Labrador” and “Golden Retriever,” and each category entails its images. But existing CLIP-based CIL methods fail to explicitly capture this inherent hierarchy, leading to fine-grained class features drift during incremental updates and ultimately to catastrophic forgetting. To address this challenge, we propose HASTEN (Hierarchical Semantic Tree Anchoring) that anchors hierarchical information into CIL to reduce catastrophic forgetting. First, we employ an external knowledge graph as supervision to embed visual and textual features in hyperbolic space, effectively preserving hierarchical structure as data evolves. Second, to mitigate catastrophic forgetting, we project gradients onto the null space of the shared hyperbolic mapper, preventing interference with prior tasks. These two steps work synergistically to enable the model to resist forgetting by maintaining hierarchical relationships. Extensive experiments show that HASTEN consistently outperforms existing methods while providing a unified structured representation.

[159] Multi-Stage Residual-Aware Unsupervised Deep Learning Framework for Consistent Ultrasound Strain Elastography

Shourov Joarder, Tushar Talukder Showrav, Md. Kamrul Hasan

Main category: cs.CV

TL;DR: MUSSE-Net is a multi-stage unsupervised deep learning framework for robust ultrasound strain elastography that addresses tissue decorrelation noise and inconsistent strain estimation through a novel architecture with residual refinement.

DetailsMotivation: Ultrasound Strain Elastography faces limitations due to tissue decorrelation noise, scarcity of ground truth data, and inconsistent strain estimation under varying deformation conditions, which hinder its clinical adoption.

Method: Proposed MUSSE-Net framework with USSE-Net backbone - a multi-stream encoder-decoder architecture that parallelly processes pre- and post-deformation RF sequences using Context-Aware Complementary Feature Fusion encoder, Tri-Cross Attention bottleneck, and Cross-Attentive Fusion sequential decoder with consistency loss and residual refinement stage.

Result: State-of-the-art performance with target SNR of 24.54, background SNR of 132.76, CNR of 59.81, and elastographic SNR of 9.73 on simulation data. Enhanced lesion-to-background contrast and significant noise suppression on clinical datasets, producing clinically interpretable strain patterns.

Conclusion: MUSSE-Net effectively overcomes key limitations in ultrasound strain elastography, demonstrating superior performance and clinical utility through its unsupervised multi-stage framework with residual refinement.

Abstract: Ultrasound Strain Elastography (USE) is a powerful non-invasive imaging technique for assessing tissue mechanical properties, offering crucial diagnostic value across diverse clinical applications. However, its clinical application remains limited by tissue decorrelation noise, scarcity of ground truth, and inconsistent strain estimation under different deformation conditions. Overcoming these barriers, we propose MUSSE-Net, a residual-aware, multi-stage unsupervised sequential deep learning framework designed for robust and consistent strain estimation. At its backbone lies our proposed USSE-Net, an end-to-end multi-stream encoder-decoder architecture that parallelly processes pre- and post-deformation RF sequences to estimate displacement fields and axial strains. The novel architecture incorporates Context-Aware Complementary Feature Fusion (CACFF)-based encoder with Tri-Cross Attention (TCA) bottleneck with a Cross-Attentive Fusion (CAF)-based sequential decoder. To ensure temporal coherence and strain stability across varying deformation levels, this architecture leverages a tailored consistency loss. Finally, with the MUSSE-Net framework, a secondary residual refinement stage further enhances accuracy and suppresses noise. Extensive validation on simulation, in vivo, and private clinical datasets from Bangladesh University of Engineering and Technology (BUET) medical center, demonstrates MUSSE-Net’s outperformed existing unsupervised approaches. On MUSSE-Net achieves state-of-the-art performance with a target SNR of 24.54, background SNR of 132.76, CNR of 59.81, and elastographic SNR of 9.73 on simulation data. In particular, on the BUET dataset, MUSSE-Net produces strain maps with enhanced lesion-to-background contrast and significant noise suppression yielding clinically interpretable strain patterns.

[160] MambaIO: Global-Coordinate Inertial Odometry for Pedestrians via Multi-Scale Frequency-Decoupled Modeling

Shanshan Zhang

Main category: cs.CV

TL;DR: MambaIO is a novel inertial odometry method that processes IMU measurements in body coordinate frame using Laplacian pyramid decomposition and Mamba architecture, achieving state-of-the-art performance for pedestrian localization.

DetailsMotivation: Recent studies in drone scenarios showed that body coordinate frame significantly improves localization accuracy compared to the widely adopted global frame, prompting re-evaluation for pedestrian inertial odometry.

Method: Decomposes IMU measurements into high-frequency and low-frequency components using Laplacian pyramid. Low-frequency component processed by Mamba architecture for contextual motion cues, high-frequency component handled by convolutional structure for fine-grained local motion details.

Result: Experiments on multiple public datasets show MambaIO substantially reduces localization error and achieves state-of-the-art performance.

Conclusion: MambaIO demonstrates superior performance for pedestrian inertial odometry by effectively leveraging body coordinate frame and novel architecture, marking the first application of Mamba architecture to inertial odometry task.

Abstract: Inertial Odometry (IO) enables real-time localization using only acceleration and angular velocity measurements from an Inertial Measurement Unit (IMU), making it a promising solution for localization in consumer-grade applications. Traditionally, IMU measurements in IO have been processed under two coordinate system paradigms: the body coordinate frame and the global coordinate frame, with the latter being widely adopted. However, recent studies in drone scenarios have demonstrated that the body frame can significantly improve localization accuracy, prompting a re-evaluation of the suitability of the global frame for pedestrian IO. To address this issue, this paper systematically evaluates the effectiveness of the global coordinate frame in pedestrian IO through theoretical analysis, qualitative inspection, and quantitative experiments. Building upon these findings, we further propose MambaIO, which decomposes IMU measurements into high-frequency and low-frequency components using a Laplacian pyramid. The low-frequency component is processed by a Mamba architecture to extract implicit contextual motion cues, while the high-frequency component is handled by a convolutional structure to capture fine-grained local motion details. Experiments on multiple public datasets show that MambaIO substantially reduces localization error and achieves state-of-the-art (SOTA) performance. To the best of our knowledge, this is the first application of the Mamba architecture to the inertial odometry task.

[161] INQUIRE-Search: A Framework for Interactive Discovery in Large-Scale Biodiversity Databases

Edward Vendrow, Julia Chae, Rupa Kurinchi-Vendhan, Isaac Eckert, Jazlynn Hall, Marta Jarzyna, Reymond Miyajima, Ruth Oliver, Laura Pollock, Lauren Schrack, Scott Yanco, Oisin Mac Aodha, Sara Beery

Main category: cs.CV

TL;DR: INQUIRE-Search is an open-source system that enables interactive natural language search within large biodiversity image databases, allowing scientists to rapidly discover and analyze ecological context from millions of images.

DetailsMotivation: Large biodiversity platforms like iNaturalist contain hundreds of millions of images with valuable ecological context (behaviors, interactions, phenology, habitat), but current methods rely on metadata filtering or manual inspection, making this information inaccessible at scale.

Method: Developed INQUIRE-Search - an open-source system that uses natural language processing to enable interactive searching within ecological image databases, allowing verification and export of relevant observations for scientific analysis.

Result: INQUIRE-Search takes a fraction of the time compared to traditional methods, enabling new scientific questions. Five case studies demonstrated diverse applications including seasonal behavior variation and forest regrowth after wildfires.

Conclusion: The tool represents a new paradigm for interactive, efficient, and scalable scientific discovery that unlocks previously inaccessible value in biodiversity datasets, requiring experts to reframe scientific priorities and develop novel methods for experiment design and uncertainty analysis.

Abstract: Large community science platforms such as iNaturalist contain hundreds of millions of biodiversity images that often capture ecological context on behaviors, interactions, phenology, and habitat. Yet most ecological workflows rely on metadata filtering or manual inspection, leaving this secondary information inaccessible at scale. We introduce INQUIRE-Search, an open-source system that enables scientists to rapidly and interactively search within an ecological image database for specific concepts using natural language, verify and export relevant observations, and utilize this discovered data for novel scientific analysis. Compared to traditional methods, INQUIRE-Search takes a fraction of the time, opening up new possibilities for scientific questions that can be explored. Through five case studies, we show the diversity of scientific applications that a tool like INQUIRE-Search can support, from seasonal variation in behavior across species to forest regrowth after wildfires. These examples demonstrate a new paradigm for interactive, efficient, and scalable scientific discovery that can begin to unlock previously inaccessible scientific value in large-scale biodiversity datasets. Finally, we emphasize using such AI-enabled discovery tools for science call for experts to reframe the priorities of the scientific process and develop novel methods for experiment design, data collection, survey effort, and uncertainty analysis.

[162] Hyperspectral Image Classification using Spectral-Spatial Mixer Network

Mohammed Q. Alkhatib

Main category: cs.CV

TL;DR: SS-MixNet is a lightweight deep learning model for hyperspectral image classification that combines 3D convolution with MLP-style mixer blocks and attention mechanisms, achieving state-of-the-art performance with only 1% labeled data.

DetailsMotivation: To develop an effective hyperspectral image classification model that can work with limited labeled data while maintaining computational efficiency and capturing both local and long-range spectral-spatial dependencies.

Method: Integrates 3D convolutional layers for local spectral-spatial feature extraction with parallel MLP-style mixer blocks for long-range dependencies, plus depthwise convolution-based attention mechanism for enhanced discriminative capability.

Result: Achieved 95.68% and 93.86% overall accuracy on QUH-Tangdaowan and QUH-Qingyun datasets respectively, outperforming 2D-CNN, 3D-CNN, IP-SWIN, SimPoolFormer, and HybridKAN methods.

Conclusion: SS-MixNet effectively delivers accurate and robust hyperspectral image classification predictions with minimal supervision, demonstrating superior performance compared to existing methods.

Abstract: This paper introduces SS-MixNet, a lightweight and effective deep learning model for hyperspectral image (HSI) classification. The architecture integrates 3D convolutional layers for local spectral-spatial feature extraction with two parallel MLP-style mixer blocks that capture long-range dependencies in spectral and spatial dimensions. A depthwise convolution-based attention mechanism is employed to enhance discriminative capability with minimal computational overhead. The model is evaluated on the QUH-Tangdaowan and QUH-Qingyun datasets using only 1% of labeled data for training and validation. SS-MixNet achieves the highest performance among compared methods, including 2D-CNN, 3D-CNN, IP-SWIN, SimPoolFormer, and HybridKAN, reaching 95.68% and 93.86% overall accuracy on the Tangdaowan and Qingyun datasets, respectively. The results, supported by quantitative metrics and classification maps, confirm the model’s effectiveness in delivering accurate and robust predictions with limited supervision. The code will be made publicly available at: https://github.com/mqalkhatib/SS-MixNet

[163] First Frame Is the Place to Go for Video Content Customization

Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu, Fuxiao Liu, Cornelia Fermuller, Brandon Y. Feng, Yiannis Aloimonos

Main category: cs.CV

TL;DR: Video models treat the first frame as a conceptual memory buffer for storing visual entities that can be reused during generation, enabling robust video content customization with minimal training.

DetailsMotivation: To challenge the traditional view of the first frame as merely a spatial-temporal starting point and reveal its role as a conceptual memory buffer in video generation models.

Method: Leverage the insight that video models implicitly use the first frame as a memory buffer to achieve video content customization using only 20-50 training examples without architectural modifications or large-scale finetuning.

Result: Demonstrated robust and generalized video content customization across diverse scenarios using minimal training data.

Conclusion: Video generation models possess a powerful, overlooked capability for reference-based video customization through their implicit treatment of the first frame as a conceptual memory buffer.

Abstract: What role does the first frame play in video generation models? Traditionally, it’s viewed as the spatial-temporal starting point of a video, merely a seed for subsequent animation. In this work, we reveal a fundamentally different perspective: video models implicitly treat the first frame as a conceptual memory buffer that stores visual entities for later reuse during generation. Leveraging this insight, we show that it’s possible to achieve robust and generalized video content customization in diverse scenarios, using only 20-50 training examples without architectural changes or large-scale finetuning. This unveils a powerful, overlooked capability of video generation models for reference-based video customization.

[164] GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

Yikun Wang, Zuyan Liu, Ziyi Wang, Pengfei Liu, Han Hu, Yongming Rao

Main category: cs.CV

TL;DR: GeoVista is an agentic model for geolocalization that integrates image zoom and web search tools within reasoning loops, trained with SFT and RL, achieving performance comparable to closed-source models.

DetailsMotivation: Current agentic visual reasoning focuses on image manipulation but lacks general-purpose capabilities for tasks requiring both visual grounding and web search, like geolocalization. Existing benchmarks don't support high-resolution imagery needed for deep agentic reasoning.

Method: Propose GeoVista agentic model with integrated tool invocation (image zoom and web search) in reasoning loops. Training pipeline includes SFT for reasoning patterns and tool-use priors, followed by RL with hierarchical rewards using multi-level geographical information.

Result: GeoVista greatly surpasses other open-source agentic models on geolocalization and achieves performance comparable to closed-source models like Gemini-2.5-flash and GPT-5 on most metrics.

Conclusion: The proposed GeoVista model with integrated tool use and hierarchical training effectively addresses geolocalization challenges, demonstrating strong performance against both open-source and closed-source models.

Abstract: Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models. In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses during reasoning. Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models. We also propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. We develop a complete training pipeline for it, including a cold-start supervised fine-tuning (SFT) stage to learn reasoning patterns and tool-use priors, followed by a reinforcement learning (RL) stage to further enhance reasoning ability. We adopt a hierarchical reward to leverage multi-level geographical information and improve overall geolocalization performance. Experimental results show that GeoVista surpasses other open-source agentic models on the geolocalization task greatly and achieves performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.

[165] RoMa v2: Harder Better Faster Denser Feature Matching

Johan Edstedt, David Nordström, Yushan Zhang, Georg Bökman, Jonathan Astermark, Viktor Larsson, Anders Heyden, Fredrik Kahl, Mårten Wadenbäck, Michael Felsberg

Main category: cs.CV

TL;DR: A new dense feature matching model that achieves state-of-the-art accuracy through architectural improvements, optimized training pipeline, and integration of DINOv3 foundation model.

DetailsMotivation: Existing dense matchers fail or perform poorly in hard real-world scenarios, and high-precision models are often too slow for practical applications.

Method: Novel matching architecture and loss, curated diverse training distribution, decoupled two-stage matching-then-refinement pipeline, custom CUDA kernel for memory optimization, and integration of DINOv3 foundation model.

Result: Significantly more accurate than predecessors, sets new state-of-the-art in dense feature matching across extensive experiments.

Conclusion: The proposed model successfully addresses key weaknesses in dense feature matching through systematic improvements, achieving superior performance and practical applicability.

Abstract: Dense feature matching aims to estimate all correspondences between two images of a 3D scene and has recently been established as the gold-standard due to its high accuracy and robustness. However, existing dense matchers still fail or perform poorly for many hard real-world scenarios, and high-precision models are often slow, limiting their applicability. In this paper, we attack these weaknesses on a wide front through a series of systematic improvements that together yield a significantly better model. In particular, we construct a novel matching architecture and loss, which, combined with a curated diverse training distribution, enables our model to solve many complex matching tasks. We further make training faster through a decoupled two-stage matching-then-refinement pipeline, and at the same time, significantly reduce refinement memory usage through a custom CUDA kernel. Finally, we leverage the recent DINOv3 foundation model along with multiple other insights to make the model more robust and unbiased. In our extensive set of experiments we show that the resulting novel matcher sets a new state-of-the-art, being significantly more accurate than its predecessors. Code is available at https://github.com/Parskatt/romav2

[166] Multi-source-free Domain Adaptation via Uncertainty-aware Adaptive Distillation

Yaxuan Song, Jianan Fan, Dongnan Liu, Weidong Cai

Main category: cs.CV

TL;DR: UAD is a method for multi-source-free unsupervised domain adaptation in medical imaging that uses uncertainty-aware knowledge distillation at model and instance levels to adapt to target domains without accessing source data.

DetailsMotivation: Existing source-free domain adaptation methods have limitations in medical contexts where data comes from multiple institutions with different equipment, requiring privacy-preserving adaptation without accessing source data.

Method: UAD uses uncertainty-aware adaptive distillation at two levels: model level for coordinated base model initialization and instance level for model adaptation guided by high-quality pseudo-labels.

Result: The method shows significant performance gains on two multi-center medical image diagnosis benchmarks compared to existing works.

Conclusion: UAD provides an effective solution for multi-source-free domain adaptation in medical imaging that handles data privacy concerns while achieving strong performance.

Abstract: Source-free domain adaptation (SFDA) alleviates the domain discrepancy among data obtained from domains without accessing the data for the awareness of data privacy. However, existing conventional SFDA methods face inherent limitations in medical contexts, where medical data are typically collected from multiple institutions using various equipment. To address this problem, we propose a simple yet effective method, named Uncertainty-aware Adaptive Distillation (UAD) for the multi-source-free unsupervised domain adaptation (MSFDA) setting. UAD aims to perform well-calibrated knowledge distillation from (i) model level to deliver coordinated and reliable base model initialisation and (ii) instance level via model adaptation guided by high-quality pseudo-labels, thereby obtaining a high-performance target domain model. To verify its general applicability, we evaluate UAD on two image-based diagnosis benchmarks among two multi-centre datasets, where our method shows a significant performance gain compared with existing works. The code is available at https://github.com/YXSong000/UAD.

[167] MK-SGN: A Spiking Graph Convolutional Network with Multimodal Fusion and Knowledge Distillation for Skeleton-based Action Recognition

Naichuan Zheng, Hailun Xia, Zeyu Liang, Yuchen Du

Main category: cs.CV

TL;DR: MK-SGN is a Spiking Graph Convolutional Network that combines SNN energy efficiency with GCN capabilities for skeleton-based action recognition, achieving >98% energy reduction while maintaining competitive accuracy.

DetailsMotivation: Address the high energy consumption of GCN-based methods for deployment on energy-constrained edge devices by leveraging the energy efficiency of Spiking Neural Networks.

Method: Proposes Spiking Multimodal Fusion (SMF) module, Self-Attention Spiking Graph Convolution (SA-SGC), Spiking Temporal Convolution (STC), and integrated knowledge distillation from GCN to SGN using intermediate-layer and soft-label distillation.

Result: Achieves >98% energy reduction compared to conventional GCN approaches while maintaining competitive recognition accuracy, surpassing both GCN frameworks in energy efficiency and SNN frameworks in accuracy.

Conclusion: Establishes a robust baseline for developing high-performance, energy-efficient SNN-based models for skeleton-based action recognition on edge devices.

Abstract: In recent years, multimodal Graph Convolutional Networks (GCNs) have achieved remarkable performance in skeleton-based action recognition. The reliance on high-energy-consuming continuous floating-point operations inherent in GCN-based methods poses significant challenges for deployment in energy-constrained, battery-powered edge devices. To address these limitations, MK-SGN, a Spiking Graph Convolutional Network with Multimodal Fusion and Knowledge Distillation, is proposed to leverage the energy efficiency of Spiking Neural Networks (SNNs) for skeleton-based action recognition for the first time. By integrating the energy-saving properties of SNNs with the graph representation capabilities of GCNs, MK-SGN achieves significant reductions in energy consumption while maintaining competitive recognition accuracy. Firstly, we formulate a Spiking Multimodal Fusion (SMF) module to effectively fuse multimodal skeleton data represented as spike-form features. Secondly, we propose the Self-Attention Spiking Graph Convolution (SA-SGC) module and the Spiking Temporal Convolution (STC) module, to capture spatial relationships and temporal dynamics of spike-form features. Finally, we propose an integrated knowledge distillation strategy to transfer information from the multimodal GCN to the SGN, incorporating both intermediate-layer distillation and soft-label distillation to enhance the performance of the SGN. MK-SGN exhibits substantial advantages, surpassing state-of-the-art GCN frameworks in energy efficiency and outperforming state-of-the-art SNN frameworks in recognition accuracy. The proposed method achieves a remarkable reduction in energy consumption, exceeding 98% compared to conventional GCN-based approaches. This research establishes a robust baseline for developing high-performance, energy-efficient SNN-based models for skeleton-based action recognition

[168] Unobtrusive Monitoring of Simulated Physical Weakness Using Fine-Grained Behavioral Features and Personalized Modeling

Chen Long-fei, Muhammad Ahmed Raza, Craig Innes, Subramanian Ramamoorthy, Robert B. Fisher

Main category: cs.CV

TL;DR: Non-intrusive camera system detects weakness in older adults by monitoring daily sitting activities using Bayesian Network analysis of body motion and inactivity features, achieving 0.97 accuracy at daily level.

DetailsMotivation: Early detection of developing health issues in older adults is crucial, but subtle weakness-related changes in daily activities are challenging to detect due to their gradual nature.

Method: Use non-intrusive camera sensor to monitor daily sitting/relaxing activities, simulate weakness via physical exercise in healthy subjects, capture real-time body motion/inactivity features with privacy protection, and apply Bayesian Network modeling.

Result: 97% accuracy in distinguishing simulated weakness at daily level; identified optimal features include non-dominant upper body motion speed/scale and inactivity distribution with 300-second window; individual-specific models perform best.

Conclusion: Fine-grained behavioral monitoring can effectively detect weakness, but personalized models are needed as no universal optimal feature set works across all individuals.

Abstract: Aging and chronic conditions affect older adults’ daily lives, making early detection of developing health issues crucial. Weakness, common in many conditions, alters physical movements and daily activities subtly. However, detecting such changes can be challenging due to their subtle and gradual nature. To address this, we employ a non-intrusive camera sensor to monitor individuals’ daily sitting and relaxing activities for signs of weakness. We simulate weakness in healthy subjects by having them perform physical exercise and observing the behavioral changes in their daily activities before and after workouts. The proposed system captures fine-grained features related to body motion, inactivity, and environmental context in real-time while prioritizing privacy. A Bayesian Network is used to model the relationships between features, activities, and health conditions. We aim to identify specific features and activities that indicate such changes and determine the most suitable time scale for observing the change. Results show 0.97 accuracy in distinguishing simulated weakness at the daily level. Fine-grained behavioral features, including non-dominant upper body motion speed and scale, and inactivity distribution, along with a 300-second window, are found most effective. However, individual-specific models are recommended as no universal set of optimal features and activities was identified across all participants.

[169] Self Pre-training with Topology- and Spatiality-aware Masked Autoencoders for 3D Medical Image Segmentation

Pengfei Gu, Huimin Li, Yejia Zhang, Chaoli Wang, Danny Z. Chen

Main category: cs.CV

TL;DR: A novel MAE extension for 3D medical image segmentation that adds topological loss for geometric shape preservation and spatial position prediction to overcome limitations of standard MAEs in capturing spatial and geometric information.

DetailsMotivation: Existing MAE pre-training methods lack the ability to capture geometric shape and spatial information, which is critical for medical image segmentation tasks, especially in 3D medical imaging.

Method: Proposes four key innovations: (1) topological loss to preserve geometric shape information, (2) pre-text task predicting centers and corners of 3D crops for spatial information, (3) extending MAE to hybrid SOTA medical segmentation architecture, (4) fine-tuning with pre-trained ViT encoder and SOTA model.

Result: Extensive experiments on five public 3D segmentation datasets demonstrate the effectiveness of the proposed approach.

Conclusion: The novel MAE extension successfully addresses geometric shape and spatial information limitations in medical image segmentation, showing improved performance across multiple 3D medical datasets.

Abstract: Masked Autoencoders (MAEs) have been shown to be effective in pre-training Vision Transformers (ViTs) for natural and medical image analysis problems. By reconstructing missing pixel/voxel information in visible patches, a ViT encoder can aggregate contextual information for downstream tasks. But, existing MAE pre-training methods, which were specifically developed with the ViT architecture, lack the ability to capture geometric shape and spatial information, which is critical for medical image segmentation tasks. In this paper, we propose a novel extension of known MAEs for self pre-training (i.e., models pre-trained on the same target dataset) for 3D medical image segmentation. (1) We propose a new topological loss to preserve geometric shape information by computing topological signatures of both the input and reconstructed volumes, learning geometric shape information. (2) We introduce a pre-text task that predicts the positions of the centers and eight corners of 3D crops, enabling the MAE to aggregate spatial information. (3) We extend the MAE pre-training strategy to a hybrid state-of-the-art (SOTA) medical image segmentation architecture and co-pretrain it alongside the ViT. (4) We develop a fine-tuned model for downstream segmentation tasks by complementing the pre-trained ViT encoder with our pre-trained SOTA model. Extensive experiments on five public 3D segmentation datasets show the effectiveness of our new approach.

[170] MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation

Minhyun Lee, Seungho Lee, Song Park, Dongyoon Han, Byeongho Heo, Hyunjung Shim

Main category: cs.CV

TL;DR: MaskRIS is a novel training framework for Referring Image Segmentation that uses image and text masking with Distortion-aware Contextual Learning to improve model robustness and achieve state-of-the-art performance.

DetailsMotivation: Previous RIS studies focused on feature alignment but neglected training techniques like data augmentation. Conventional image augmentations degrade RIS performance, while simple masking shows promise.

Method: Proposes MaskRIS framework with image and text masking, followed by Distortion-aware Contextual Learning (DCL) to exploit masking benefits for handling occlusions and linguistic complexities.

Result: MaskRIS significantly improves RIS performance, works with various RIS models, and achieves state-of-the-art results on RefCOCO, RefCOCO+, and RefCOCOg datasets in both fully and weakly supervised settings.

Conclusion: MaskRIS demonstrates that effective data augmentation through masking strategies can substantially enhance RIS model robustness and performance, establishing new benchmarks for the task.

Abstract: Referring Image Segmentation (RIS) is an advanced vision-language task that involves identifying and segmenting objects within an image as described by free-form text descriptions. While previous studies focused on aligning visual and language features, exploring training techniques, such as data augmentation, remains underexplored. In this work, we explore effective data augmentation for RIS and propose a novel training framework called Masked Referring Image Segmentation (MaskRIS). We observe that the conventional image augmentations fall short of RIS, leading to performance degradation, while simple random masking significantly enhances the performance of RIS. MaskRIS uses both image and text masking, followed by Distortion-aware Contextual Learning (DCL) to fully exploit the benefits of the masking strategy. This approach can improve the model’s robustness to occlusions, incomplete information, and various linguistic complexities, resulting in a significant performance improvement. Experiments demonstrate that MaskRIS can easily be applied to various RIS models, outperforming existing methods in both fully supervised and weakly supervised settings. Finally, MaskRIS achieves new state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg datasets. Code is available at https://github.com/naver-ai/maskris.

[171] Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in Multimodal Large Language Models

Zehao Wang, Xinpeng Liu, Xiaoqian Wu, Yudonglin Zhang, Zhou Fang, Yifan Fang, Junfu Pu, Cewu Lu, Yong-Lu Li

Main category: cs.CV

TL;DR: First investigation of verb hallucination in Multimodal Large Language Models (MLLMs), revealing severe issues that existing object-focused mitigation methods fail to address, with proposed verb knowledge-based tuning achieving significant reduction.

DetailsMotivation: While many methods address object/noun-related hallucinations in MLLMs, verb concepts crucial for action understanding have been overlooked, creating a significant gap in hallucination mitigation research.

Method: Proposed a novel rich verb knowledge-based tuning method specifically designed to mitigate verb hallucinations in MLLMs.

Result: Most state-of-the-art MLLMs suffer from severe verb hallucination, and existing object hallucination mitigation methods are ineffective for verb hallucinations. The proposed method significantly reduces verb-related hallucinations.

Conclusion: Verb hallucination is a critical but overlooked issue in MLLMs that requires specialized mitigation approaches, as object-focused methods are insufficient for addressing action-related hallucinations.

Abstract: Multimodal Large Language Models (MLLMs) have garnered significant attention recently and demonstrate outstanding capabilities in various tasks such as OCR, VQA, captioning, $\textit{etc}$. However, hallucination remains a persistent issue. While numerous methods have been proposed to mitigate hallucinations, achieving notable improvements, these methods primarily focus on mitigating hallucinations about $\textbf{object/noun-related}$ concepts. Verb concepts, crucial for understanding human actions, have been largely overlooked. In this paper, to the best of our knowledge, we are the $\textbf{first}$ to investigate the $\textbf{verb hallucination}$ phenomenon of MLLMs from various perspectives. Our findings reveal that most state-of-the-art MLLMs suffer from severe verb hallucination. To assess the effectiveness of existing mitigation methods for object concept hallucination on verb hallucination, we evaluated these methods and found that they do not effectively address verb hallucination. To address this issue, we propose a novel rich verb knowledge-based tuning method to mitigate verb hallucination. The experiment results demonstrate that our method significantly reduces hallucinations related to verbs.

[172] Efficient Document Image Dewarping via Hybrid Deep Learning and Cubic Polynomial Geometry Restoration

Valery Istomin, Oleg Pereziabov, Ilya Afanasyev

Main category: cs.CV

TL;DR: Hybrid document dewarping method combining YOLOv8 for detection with classical CV for geometry restoration, achieving superior OCR accuracy with lower computational cost than pure deep learning approaches.

DetailsMotivation: Camera-captured document images suffer from geometric distortions that reduce OCR accuracy, requiring efficient automated dewarping methods that balance accuracy with computational efficiency.

Method: Hybrid approach using YOLOv8 for document detection and segmentation, followed by classical computer vision techniques including cubic polynomial interpolation of boundaries and image remapping for geometry restoration.

Result: Achieves lowest median Character Error Rate (CER=0.0235), Levenshtein Distance (LD=27.8), and highest Jaro-Winkler similarity (JW=0.902), outperforming state-of-the-art methods and mobile applications while using fewer computational resources.

Conclusion: The hybrid methodology effectively restores document geometry with superior computational efficiency compared to deep learning approaches, making it suitable for resource-constrained applications while maintaining high-quality digitization.

Abstract: Camera-captured document images often suffer from geometric distortions caused by paper deformation, perspective distortion, and lens aberrations, significantly reducing OCR accuracy. This study develops an efficient automated method for document image dewarping that balances accuracy with computational efficiency. We propose a hybrid approach combining deep learning for document detection with classical computer vision for geometry restoration. YOLOv8 performs initial document segmentation and mask generation. Subsequently, classical CV techniques construct a topological 2D grid through cubic polynomial interpolation of document boundaries, followed by image remapping to correct nonlinear distortions. A new annotated dataset and open-source framework are provided to facilitate reproducibility and further research. Experimental evaluation against state-of-the-art methods (RectiNet, DocGeoNet, DocTr++) and mobile applications (DocScan, CamScanner, TapScanner) demonstrates superior performance. Our method achieves the lowest median Character Error Rate (CER=0.0235), Levenshtein Distance (LD=27.8), and highest Jaro–Winkler similarity (JW=0.902), approaching the quality of scanned originals. The approach requires significantly fewer computational resources and memory compared to pure deep learning solutions while delivering better OCR readability and geometry restoration quality. The proposed hybrid methodology effectively restores document geometry with computational efficiency superior to existing deep learning approaches, making it suitable for resource-constrained applications while maintaining high-quality document digitization. Project page: https://github.com/HorizonParadox/DRCCBI

[173] Systematic Evaluation and Guidelines for Segment Anything Model in Surgical Video Analysis

Cheng Yuan, Jian Jiang, Kunyi Yang, Lv Wu, Rui Wang, Zi Meng, Haonan Ping, Ziyu Xu, Yifan Zhou, Wanli Song, Hesheng Wang, Qi Dou, Yutong Ban

Main category: cs.CV

TL;DR: First comprehensive evaluation of SAM2’s zero-shot capability for surgical video segmentation across 9 datasets and 17 surgery types, revealing notable adaptability in structured scenarios but performance gaps in dynamic surgical conditions.

DetailsMotivation: Surgical video segmentation is critical for AI but limited by annotated data. SAM2 offers potential for zero-shot surgical segmentation, but its applicability in complex surgical environments with tissue deformation and instrument variability remains unexplored.

Method: Comprehensive evaluation of SAM2’s zero-shot capability across 9 surgical datasets covering laparoscopic, endoscopic, and robotic procedures. Analyzed various prompting strategies (points, boxes, masks) and finetuning approaches (dense, sparse), plus robustness to surgical challenges and generalization across procedures.

Result: SAM2 demonstrates notable zero-shot adaptability in structured scenarios (instrument segmentation, multi-organ segmentation, scene segmentation), but performance varies under dynamic surgical conditions, highlighting gaps in handling temporal coherence and domain-specific artifacts.

Conclusion: Results highlight future pathways to adaptive data-efficient solutions for surgical data science, identifying areas where SAM2’s zero-shot capabilities succeed and where improvements are needed for complex surgical environments.

Abstract: Surgical video segmentation is critical for AI to interpret spatial-temporal dynamics in surgery, yet model performance is constrained by limited annotated data. The SAM2 model, pretrained on natural videos, offers potential for zero-shot surgical segmentation, but its applicability in complex surgical environments, with challenges like tissue deformation and instrument variability, remains unexplored. We present the first comprehensive evaluation of the zero-shot capability of SAM2 in 9 surgical datasets (17 surgery types), covering laparoscopic, endoscopic, and robotic procedures. We analyze various prompting (points, boxes, mask) and {finetuning (dense, sparse) strategies}, robustness to surgical challenges, and generalization across procedures and anatomies. Key findings reveal that while SAM2 demonstrates notable zero-shot adaptability in structured scenarios (e.g., instrument segmentation, {multi-organ segmentation}, and scene segmentation), its performance varies under dynamic surgical conditions, highlighting gaps in handling temporal coherence and domain-specific artifacts. These results highlight future pathways to adaptive data-efficient solutions for the surgical data science field.

[174] FireCastNet: Earth-as-a-Graph for Seasonal Fire Prediction

Dimitrios Michail, Charalampos Davalas, Konstantinos Chafis, Lefki-Ioanna Panagiotou, Ioannis Prapas, Spyros Kondylatos, Nikolaos Ioannis Bountos, Ioannis Papoutsis

Main category: cs.CV

TL;DR: FireCastNet is a deep learning model combining 3D CNNs and Graph Neural Networks for global wildfire prediction up to 6 months in advance, outperforming state-of-the-art methods.

DetailsMotivation: Accurate seasonal wildfire forecasting is critical for disaster preparedness and ecosystem management due to intensifying fire weather conditions from climate change.

Method: Uses 3D convolutional encoding with GraphCast-based GNNs to model spatio-temporal dependencies, treating Earth as an interconnected graph. Leverages SeasFire dataset with climate, vegetation, and human variables.

Result: Superior performance in global burned area forecasting, especially in fire-prone regions (Africa, South America, Southeast Asia). Longer input time-series improve robustness, spatial context enhances extended forecasting.

Conclusion: Modeling Earth system interactions is crucial for long-term wildfire prediction. Local area modeling provides enhanced resolution for region-specific forecasts.

Abstract: With climate change intensifying fire weather conditions globally, accurate seasonal wildfire forecasting has become critical for disaster preparedness and ecosystem management. We introduce FireCastNet, a novel deep learning architecture that combines 3D convolutional encoding with GraphCast-based Graph Neural Networks (GNNs) to model complex spatio-temporal dependencies for global wildfire prediction. Our approach leverages the SeasFire dataset, a comprehensive multivariate Earth system datacube containing climate, vegetation, and human-related variables, to forecast burned area patterns up to six months in advance. FireCastNet treats the Earth as an interconnected graph, enabling it to capture both local fire dynamics and long-range teleconnections that influence wildfire behavior across different spatial and temporal scales. Through comprehensive benchmarking against state-of-the-art models including GRU, Conv-GRU, Conv-LSTM, U-TAE, and TeleViT, we demonstrate that FireCastNet achieves superior performance in global burned area forecasting, with particularly strong results in fire-prone regions such as Africa, South America, and Southeast Asia. Our analysis reveals that longer input time-series significantly improve prediction robustness, while spatial context integration enhances model performance across extended forecasting horizons. Additionally, we implement local area modeling techniques that provide enhanced spatial resolution and accuracy for region-specific predictions. These findings highlight the importance of modeling Earth system interactions for long-term wildfire prediction.

[175] Scale-invariant brain morphometry: application to sulcal depth

Maxime Dieudonné, Guillaume Auzias, Julien Lefèvre

Main category: cs.CV

TL;DR: This paper provides the first quantitative analysis of brain size’s influence on sulcal depth, introduces a novel scale-invariant sulcal depth estimation method, and validates it using 1,987 subjects across development.

DetailsMotivation: To understand how global brain size influences cortical surface morphometry features, particularly sulcal depth, which has been understudied despite its importance in basic research and clinical applications.

Method: Developed a novel scale-invariant method for sulcal depth estimation based on an original formalization of the problem, with validation framework and benchmark data.

Result: Provided first quantitative analysis of brain size’s effect on sulcal depth measurements and demonstrated biological relevance across developmental stages.

Conclusion: The study offers important contributions to understanding cortical geometry by addressing brain size effects on sulcal depth and providing a validated, scale-invariant measurement approach.

Abstract: The geometry of the human cortex is complex and highly variable, with interactions between brain size, cortical folding, and age well-documented in the literature. However, few studies have explored how global brain size influences morphometry features of the cortical surface derived from anatomical MRI. In this work, we focus on sulcal depth, an imaging phenotype that has gained attention in both basic research and clinical applications. We make key contributions to the field by: 1) providing the first quantitative analysis of the influence of brain size on sulcal depth measurements; 2) introducing a novel, scale-invariant method for sulcal depth estimation based on an original formalization of the problem; 3) presenting a validation framework and sharing our code and benchmark data with the community; and 4) demonstrating the biological relevance of our new sulcal depth measure using a large sample of 1,987 subjects spanning the developmental period from 26 weeks post-conception to adulthood.

[176] FQ-PETR: Fully Quantized Position Embedding Transformation for Multi-View 3D Object Detection

Jiangyong Yu, Changyong Shu, Sifan Zhou, Zichen Yu, Xing Hu, Yan Chen, Dawei Yang

Main category: cs.CV

TL;DR: FQ-PETR is a fully quantized framework for PETR-based 3D detection models that addresses quantization challenges through three innovations: QFPE for embedding alignment, DULUT for non-linear function approximation, and QANS for attention stabilization, achieving near-floating-point accuracy with 75% latency reduction.

DetailsMotivation: PETR models excel in 3D detection but face deployment challenges due to high computational cost and memory footprint. Direct quantization causes severe accuracy degradation due to multi-modal feature disparity and inefficient non-linear operator quantization.

Method: Three key innovations: (1) QFPE replaces multi-point sampling with LiDAR-guided single-point sampling and anchor-based embedding, (2) DULUT approximates non-linear functions with cascaded linear lookup tables, (3) QANS performs quantization after softmax numerical stabilization.

Result: FQ-PETR achieves near-floating-point accuracy (only 1% degradation) under W8A8 quantization while reducing latency by up to 75%, significantly outperforming existing PTQ and QAT baselines across PETR variants.

Conclusion: FQ-PETR successfully enables efficient deployment of PETR-based 3D detection models through a comprehensive quantization framework that addresses key challenges in multi-modal feature alignment and non-linear operator quantization.

Abstract: Camera-based multi-view 3D detection is crucial for autonomous driving. PETR and its variants (PETRs) excel in benchmarks but face deployment challenges due to high computational cost and memory footprint. Quantization is an effective technique for compressing deep neural networks by reducing the bit width of weights and activations. However, directly applying existing quantization methods to PETRs leads to severe accuracy degradation. This issue primarily arises from two key challenges: (1) significant magnitude disparity between multi-modal features-specifically, image features and camera-ray positional embeddings (PE), and (2) the inefficiency and approximation error of quantizing non-linear operators, which commonly rely on hardware-unfriendly computations. In this paper, we propose FQ-PETR, a fully quantized framework for PETRs, featuring three key innovations: (1) Quantization-Friendly LiDAR-ray Position Embedding (QFPE): Replacing multi-point sampling with LiDAR-prior-guided single-point sampling and anchor-based embedding eliminates problematic non-linearities (e.g., inverse-sigmoid) and aligns PE scale with image features, preserving accuracy. (2) Dual-Lookup Table (DULUT): This algorithm approximates complex non-linear functions using two cascaded linear LUTs, achieving high fidelity with minimal entries and no specialized hardware. (3) Quantization After Numerical Stabilization (QANS): Performing quantization after softmax numerical stabilization mitigates attention distortion from large inputs. On PETRs (e.g. PETR, StreamPETR, PETRv2, MV2d), FQ-PETR under W8A8 achieves near-floating-point accuracy (1% degradation) while reducing latency by up to 75%, significantly outperforming existing PTQ and QAT baselines.

[177] Interpretable Retinal Disease Prediction Using Biology-Informed Heterogeneous Graph Representations

Laurin Lux, Alexander H. Berger, Maria Romeo Tricas, Richard Rosen, Alaa E. Fayed, Sobha Sivaprasada, Linus Kreitner, Jonas Weidner, Martin J. Menten, Daniel Rueckert, Johannes C. Paetzold

Main category: cs.CV

TL;DR: A biology-informed graph neural network method for diabetic retinopathy staging that outperforms existing models while providing detailed, interpretable explanations of retinal vessel segments and intercapillary areas.

DetailsMotivation: To bridge the gap between high-performing but uninterpretable neural networks and interpretable but lower-performing biomarker-based classifiers in medical diagnostics, particularly for diabetic retinopathy staging from OCTA images.

Method: Uses a novel biology-informed heterogeneous graph representation modeling retinal vessel segments, intercapillary areas, and FAZ, framed as a graph-level classification task solved with an efficient graph neural network.

Result: Outperforms all established baselines (biomarker-based classifiers, CNNs, vision transformers) on two datasets, with unprecedented detail in localizing critical vessels and intercapillary areas.

Conclusion: The approach successfully develops an interpretable clinical decision-support tool for ophthalmology that maintains high performance while providing human-understandable explanations.

Abstract: Interpretability is crucial to enhance trust in machine learning models for medical diagnostics. However, most state-of-the-art image classifiers based on neural networks are not interpretable. As a result, clinicians often resort to known biomarkers for diagnosis, although biomarker-based classification typically performs worse than large neural networks. This work proposes a method that surpasses the performance of established machine learning models while simultaneously improving prediction interpretability for diabetic retinopathy staging from optical coherence tomography angiography (OCTA) images. Our method is based on a novel biology-informed heterogeneous graph representation that models retinal vessel segments, intercapillary areas, and the foveal avascular zone (FAZ) in a human-interpretable way. This graph representation allows us to frame diabetic retinopathy staging as a graph-level classification task, which we solve using an efficient graph neural network. We benchmark our method against well-established baselines, including classical biomarker-based classifiers, convolutional neural networks (CNNs), and vision transformers. Our model outperforms all baselines on two datasets. Crucially, we use our biology-informed graph to provide explanations of unprecedented detail. Our approach surpasses existing methods in precisely localizing and identifying critical vessels or intercapillary areas. In addition, we give informative and human-interpretable attributions to critical characteristics. Our work contributes to the development of clinical decision-support tools in ophthalmology.

[178] Class-Aware PillarMix: Can Mixed Sample Data Augmentation Enhance 3D Object Detection with Radar Point Clouds?

Miao Zhang, Sherif Abdulatif, Benedikt Loesch, Marco Altmann, Bin Yang

Main category: cs.CV

TL;DR: CAPMix is a novel mixed sample data augmentation method for radar point clouds that applies MixUp at pillar level with class-aware mixing ratios, addressing challenges in radar data like irregular distribution and sparsity.

DetailsMotivation: Existing MSDA methods mainly target LiDAR data and face challenges when applied to radar point clouds due to irregular angular distribution, multi-radar layout deviations, and point sparsity.

Method: Proposes Class-Aware PillarMix (CAPMix) that applies MixUp at pillar level with independent mixing ratios per pillar. Uses class-specific distributions: favoring points from other samples for dense objects, and preserving more original points for sparse objects.

Result: Significantly boosts performance and outperforms existing MSDA approaches across Bosch Street and K-Radar datasets.

Conclusion: CAPMix is a straightforward yet effective approach that generates more diverse training data for radar point clouds and should spark further investigation into MSDA techniques for radar data.

Abstract: Due to the significant effort required for data collection and annotation in 3D perception tasks, mixed sample data augmentation (MSDA) has been widely studied to generate diverse training samples by mixing existing data. Recently, many MSDA techniques have been developed for point clouds, but they mainly target LiDAR data, leaving their application to radar point clouds largely unexplored. In this paper, we examine the feasibility of applying existing MSDA methods to radar point clouds and identify several challenges in adapting these techniques. These obstacles stem from the radar’s irregular angular distribution, deviations from a single-sensor polar layout in multi-radar setups, and point sparsity. To address these issues, we propose Class-Aware PillarMix (CAPMix), a novel MSDA approach that applies MixUp at the pillar level in 3D point clouds, guided by class labels. Unlike methods that rely a single mix ratio to the entire sample, CAPMix assigns an independent ratio to each pillar, boosting sample diversity. To account for the density of different classes, we use class-specific distributions: for dense objects (e.g., large vehicles), we skew ratios to favor points from another sample, while for sparse objects (e.g., pedestrians), we sample more points from the original. This class-aware mixing retains critical details and enriches each sample with new information, ultimately generating more diverse training data. Experimental results demonstrate that our method not only significantly boosts performance but also outperforms existing MSDA approaches across two datasets (Bosch Street and K-Radar). We believe that this straightforward yet effective approach will spark further investigation into MSDA techniques for radar data.

[179] One Latent Space to Rule All Degradations: Unifying Restoration Knowledge for Image Fusion

Haolong Ma, Hui Li, Chunyang Cheng, Zeyang Zhang, Xiaoqing Luo, Xiaoning Song, Xiao-Jun Wu

Main category: cs.CV

TL;DR: LURE is a degradation-aware fusion model that learns unified latent representations for infrared and visible image fusion, outperforming SOTA methods by avoiding end-to-end learning dependencies and leveraging real-world restoration datasets.

DetailsMotivation: Current All-in-One Degradation-Aware Fusion Models (ADFMs) rely on end-to-end learning and synthetic datasets, limiting their performance and producing low-quality results due to rough learning strategies and non-real-world scenario dependencies.

Method: LURE learns a Unified Latent Feature Space (ULFS) to avoid complex data format dependencies, uses a novel loss function for stable representation learning, incorporates real-world image restoration datasets, and employs internal residual blocks to enhance representation capability.

Result: Experiments show LURE outperforms state-of-the-art methods across general fusion, degradation-aware fusion, and downstream tasks.

Conclusion: LURE provides an effective solution for degradation-aware image fusion by learning unified latent representations, achieving superior performance while seamlessly integrating with real-world datasets.

Abstract: All-in-One Degradation-Aware Fusion Models (ADFMs) as one of multi-modal image fusion models, which aims to address complex scenes by mitigating degradations from source images and generating high-quality fused images. Mainstream ADFMs rely on end-to-end learning and heavily synthesized datasets to achieve degradation awareness and fusion. This rough learning strategy and non-real world scenario dataset dependence often limit their upper-bound performance, leading to low-quality results. To address these limitations, we present LURE, a Learning-driven Unified REpresentation model for infrared and visible image fusion, which is degradation-aware. LURE learns a Unified Latent Feature Space (ULFS) to avoid the dependency on complex data formats inherent in previous end-to-end learning pipelines. It further improves image fusion quality by leveraging the intrinsic relationships between multi-modalities. A novel loss function is also proposed to drive the learning of unified latent representations more stable.More importantly, LURE seamlessly incorporates existing high-quality real-world image restoration datasets. To further enhance the model’s representation capability, we design a simple yet effective structure, termed internal residual block, to facilitate the learning of latent features. Experiments show our method outperforms state-of-the-art (SOTA) methods across general fusion, degradation-aware fusion, and downstream tasks. The code is available in the supplementary materials.

[180] Other Vehicle Trajectories Are Also Needed: A Driving World Model Unifies Ego-Other Vehicle Trajectories in Video Latent Space

Jian Zhu, Zhengyu Jia, Tian Gao, Jiaxin Deng, Shidi Li, Lang Zhang, Fu Liu, Peng Jia, Xianpeng Lang

Main category: cs.CV

TL;DR: EOT-WM is a driving world model that unifies ego-other vehicle trajectories in videos for realistic driving simulation, addressing limitations of existing models that only control ego vehicle trajectories.

DetailsMotivation: Existing world models focus only on ego vehicle trajectories, making other vehicles uncontrollable and hindering realistic simulation of vehicle interactions in driving scenarios.

Method: Projects BEV trajectories to image coordinates for vehicle-trajectory matching, uses Spatial-Temporal VAE for alignment, and employs trajectory-injected diffusion Transformer for video generation guided by unified ego-other trajectories.

Result: Outperforms state-of-the-art by 30% in FID and 55% in FVD on nuScenes dataset, and can predict unseen driving scenes with self-produced trajectories.

Conclusion: EOT-WM successfully enables unified control of both ego and other vehicle trajectories for realistic driving simulation and interaction modeling.

Abstract: Advanced end-to-end autonomous driving systems predict other vehicles’ motions and plan ego vehicle’s trajectory. The world model that can foresee the outcome of the trajectory has been used to evaluate the autonomous driving system. However, existing world models predominantly emphasize the trajectory of the ego vehicle and leave other vehicles uncontrollable. This limitation hinders their ability to realistically simulate the interaction between the ego vehicle and the driving scenario. In this paper, we propose a driving World Model named EOT-WM, unifying Ego-Other vehicle Trajectories in videos for driving simulation. Specifically, it remains a challenge to match multiple trajectories in the BEV space with each vehicle in the video to control the video generation. We first project ego-other vehicle trajectories in the BEV space into the image coordinate for vehicle-trajectory match via pixel positions. Then, trajectory videos are encoded by the Spatial-Temporal Variational Auto Encoder to align with driving video latents spatially and temporally in the unified visual space. A trajectory-injected diffusion Transformer is further designed to denoise the noisy video latents for video generation with the guidance of ego-other vehicle trajectories. In addition, we propose a metric based on control latent similarity to evaluate the controllability of trajectories. Extensive experiments are conducted on the nuScenes dataset, and the proposed model outperforms the state-of-the-art method by 30% in FID and 55% in FVD. The model can also predict unseen driving scenes with self-produced trajectories.

[181] Integration of nested cross-validation, automated hyperparameter optimization, high-performance computing to reduce and quantify the variance of test performance estimation of deep learning models

Paul Calle, Averi Bates, Justin C. Reynolds, Yunlong Liu, Haoyang Cui, Sinaro Ly, Chen Wang, Qinghao Zhang, Alberto J. de Armendi, Shashank S. Shettar, Kar Ming Fung, Qinggong Tang, Chongle Pan

Main category: cs.CV

TL;DR: NACHOS is a framework that integrates nested cross-validation and automated hyperparameter optimization using HPC to reduce variance in DL model performance estimation for medical imaging, with DACHOS extending it for deployment.

DetailsMotivation: Current approaches using single fixed test sets fail to quantify variance in performance metrics, compromising trustworthiness for real-world deployment of medical imaging DL models.

Method: Integrates Nested Cross-Validation and Automated Hyperparameter Optimization within a parallelized high-performance computing framework, demonstrated on chest X-ray and OCT datasets under multiple data partitioning schemes.

Result: NCV quantifies and reduces estimation variance, AHPO optimizes hyperparameters consistently across test folds, and HPC ensures computational feasibility.

Conclusion: NACHOS and DACHOS provide a scalable, reproducible, and trustworthy framework for DL model evaluation and deployment in medical imaging, with open-source code available.

Abstract: Background and Objectives: The variability and biases in the real-world performance benchmarking of deep learning models for medical imaging compromise their trustworthiness for real-world deployment. The common approach of holding out a single fixed test set fails to quantify the variance in the estimation of test performance metrics. This study introduces NACHOS (Nested and Automated Cross-validation and Hyperparameter Optimization using Supercomputing) to reduce and quantify the variance of test performance metrics of deep learning models. Methods: NACHOS integrates Nested Cross-Validation (NCV) and Automated Hyperparameter Optimization (AHPO) within a parallelized high-performance computing (HPC) framework. NACHOS was demonstrated on a chest X-ray repository and an Optical Coherence Tomography (OCT) dataset under multiple data partitioning schemes. Beyond performance estimation, DACHOS (Deployment with Automated Cross-validation and Hyperparameter Optimization using Supercomputing) is introduced to leverage AHPO and cross-validation to build the final model on the full dataset, improving expected deployment performance. Results: The findings underscore the importance of NCV in quantifying and reducing estimation variance, AHPO in optimizing hyperparameters consistently across test folds, and HPC in ensuring computational feasibility. Conclusions: By integrating these methodologies, NACHOS and DACHOS provide a scalable, reproducible, and trustworthy framework for DL model evaluation and deployment in medical imaging. To maximize public availability, the full open-source codebase is provided at https://github.com/thepanlab/NACHOS

[182] AdCare-VLM: Towards a Unified and Pre-aligned Latent Representation for Healthcare Video Understanding

Md Asaduzzaman Jabin, Hanqi Jiang, Yiwei Li, Patrick Kaggwa, Eugene Douglass, Juliet N. Sekandi, Tianming Liu

Main category: cs.CV

TL;DR: AdCare-VLM is a specialized multimodal vision-language model for medication adherence monitoring using patient videos, achieving 3.1-3.54% improvement over existing methods on tuberculosis medication monitoring tasks.

DetailsMotivation: Chronic diseases require strict medication adherence, but adherence is often compromised by patient behavior, costs, and healthcare infrastructure limitations. There's a need for automated monitoring systems to improve adherence detection.

Method: Developed AdCare-VLM based on LLaVA architecture with unified visual latent space pre-alignment. Used 806 custom-annotated TB medication monitoring videos to fine-tune the model for adherence pattern detection. Created LLM-TB-VQA dataset with positive, negative, and ambiguous adherence cases.

Result: Outperformed parameter-efficient fine-tuning enabled VLM models (LLaVA-V1.5 and Chat-UniVi) with absolute improvements of 3.1% to 3.54% across various configurations. Successfully identified correlations between visual features (face visibility, medication, water intake) and medical concepts.

Conclusion: The proposed method effectively integrates aligned visual-linguistic representations for medication adherence monitoring, providing improved performance and interpretability through attention map visualizations.

Abstract: Chronic diseases, including diabetes, hypertension, asthma, HIV-AIDS, epilepsy, and tuberculosis, necessitate rigorous adherence to medication to avert disease progression, manage symptoms, and decrease mortality rates. Adherence is frequently undermined by factors including patient behavior, caregiver support, elevated medical costs, and insufficient healthcare infrastructure. We propose AdCare-VLM, a specialized LLaVA-based multimodal large vision language model (LVLM) by introducing a unified visual latent space with pre-alignment to facilitate visual question answering (VQA) concerning medication adherence through patient videos. We employ a private dataset comprising 806 custom-annotated tuberculosis (TB) medication monitoring videos, which have been labeled by clinical experts, to fine-tune the model for adherence pattern detection. We present LLM-TB-VQA, a detailed medical adherence VQA dataset that encompasses positive, negative, and ambiguous adherence cases. Our method identifies correlations between visual features, such as the clear visibility of the patient’s face, medication, water intake, and the act of ingestion, and their associated medical concepts in captions. This facilitates the integration of aligned visual-linguistic representations and improves multimodal interactions. Experimental results indicate that our method surpasses parameter-efficient fine-tuning (PEFT) enabled VLM models, such as LLaVA-V1.5 and Chat-UniVi, with absolute improvements ranging from 3.1% to 3.54% across pre-trained, regular, and low-rank adaptation (LoRA) configurations. Comprehensive ablation studies and attention map visualizations substantiate our approach, enhancing interpretability.

[183] InvFusion: Bridging Supervised and Zero-shot Diffusion for Inverse Problems

Noam Elata, Hyungjin Chung, Jong Chul Ye, Tomer Michaeli, Michael Elad

Main category: cs.CV

TL;DR: InvFusion is a training-based degradation-aware posterior sampler that combines supervised performance with zero-shot flexibility by integrating degradation operators directly into diffusion denoisers.

DetailsMotivation: Existing methods face a trade-off: zero-shot approaches handle any linear degradation but sacrifice accuracy, while training-based methods are accurate but inflexible to test-time degradation changes.

Method: Novel architectural design that integrates degradation operators directly into diffusion denoisers, enabling degradation-aware posterior sampling.

Result: State-of-the-art performance on FFHQ and ImageNet datasets, outperforming both degradation-aware zero-shot and blind training-based methods.

Conclusion: InvFusion successfully bridges the gap between supervised and zero-shot approaches, offering both strong performance and flexibility for inverse problems.

Abstract: Diffusion Models have demonstrated remarkable capabilities in handling inverse problems, offering high-quality posterior-sampling-based solutions. Despite significant advances, a fundamental trade-off persists regarding the way the conditioned synthesis is employed: Zero-shot approaches can accommodate any linear degradation but rely on approximations that reduce accuracy. In contrast, training-based methods model the posterior correctly, but cannot adapt to the degradation at test-time. Here we introduce InvFusion, the first training-based degradation-aware posterior sampler. InvFusion combines the best of both worlds – the strong performance of supervised approaches and the flexibility of zero-shot methods. This is achieved through a novel architectural design that seamlessly integrates the degradation operator directly into the diffusion denoiser. We compare InvFusion against existing general-purpose posterior samplers, both degradation-aware zero-shot techniques and blind training-based methods. Experiments on the FFHQ and ImageNet datasets demonstrate state-of-the-art performance. Beyond posterior sampling, we further demonstrate the applicability of our architecture, operating as a general Minimum Mean Square Error predictor, and as a Neural Posterior Principal Component estimator.

[184] Event Stream Filtering via Probability Flux Estimation

Jinze Chen, Wei Zhai, Yang Cao, Bin Li, Zheng-Jun Zha

Main category: cs.CV

TL;DR: EDFilter is a real-time event denoising framework that models event generation as probability fluxes from irradiance diffusion, enabling high-fidelity motion reconstruction.

DetailsMotivation: Event cameras capture brightness changes with microsecond latency but suffer from severe noise and signal inconsistencies. Existing filters ignore inter-event time information, producing sparse outputs that limit continuous irradiance dynamics reconstruction.

Method: Models event generation as threshold-crossing probability fluxes from stochastic diffusion of irradiance trajectories. Uses nonparametric kernel-based estimation of probability flux and reconstructs continuous event density flow with O(1) recursive solver for real-time processing.

Result: Achieves high-fidelity, physically interpretable event denoising and motion reconstruction. Introduces Rotary Event Dataset (RED) with microsecond-resolution ground-truth irradiance flow for evaluation.

Conclusion: EDFilter provides an effective framework for event denoising that preserves temporal information and enables continuous irradiance dynamics reconstruction, outperforming existing methods that ignore inter-event time intervals.

Abstract: Event cameras asynchronously capture brightness changes with microsecond latency, offering exceptional temporal precision but suffering from severe noise and signal inconsistencies. Unlike conventional signals, events carry state information through polarities and process information through inter-event time intervals. However, existing event filters often ignore the latter, producing outputs that are sparser than the raw input and limiting the reconstruction of continuous irradiance dynamics. We propose the Event Density Flow Filter (EDFilter), a framework that models event generation as threshold-crossing probability fluxes arising from the stochastic diffusion of irradiance trajectories. EDFilter performs nonparametric, kernel-based estimation of probability flux and reconstructs the continuous event density flow using an O(1) recursive solver, enabling real-time processing. The Rotary Event Dataset (RED), featuring microsecond-resolution ground-truth irradiance flow under controlled illumination is also presented for event quality evaluation. Experiments demonstrate that EDFilter achieves high-fidelity, physically interpretable event denoising and motion reconstruction.

[185] Causal Representation Learning with Observational Grouping for CXR Classification

Rajat Rasal, Avinash Kori, Ben Glocker

Main category: cs.CV

TL;DR: Learning identifiable causal representations for chest X-ray disease classification by grouping observations to enforce invariance across race, sex, and imaging views.

DetailsMotivation: To improve generalisability and robustness of task-specific latent features in medical imaging by uncovering true causal relationships underlying data generation processes.

Method: End-to-end framework that groups observations to learn identifiable representations, enforcing invariance with respect to race, sex, and imaging views.

Result: Causal representations improve generalisability and robustness across multiple classification tasks when grouping is used.

Conclusion: Grouping observations to learn identifiable causal representations enhances model performance and reliability in medical imaging applications.

Abstract: Identifiable causal representation learning seeks to uncover the true causal relationships underlying a data generation process. In medical imaging, this presents opportunities to improve the generalisability and robustness of task-specific latent features. This work introduces the concept of grouping observations to learn identifiable representations for disease classification in chest X-rays via an end-to-end framework. Our experiments demonstrate that these causal representations improve generalisability and robustness across multiple classification tasks when grouping is used to enforce invariance w.r.t race, sex, and imaging views.

[186] TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials

Bofei Zhang, Zirui Shang, Zhi Gao, Wang Zhang, Rui Xie, Xiaojian Ma, Tao Yuan, Xinxiao Wu, Song-Chun Zhu, Qing Li

Main category: cs.CV

TL;DR: TongUI framework builds generalized GUI agents by learning from multimodal web tutorials, creating the GUI-Net dataset with 143K trajectories across multiple OS and apps, achieving 10% performance improvements over baselines.

DetailsMotivation: Lack of sufficient trajectory data across various operating systems and applications for developing generalized GUI agents due to high manual annotation costs.

Method: Crawl and process online GUI tutorials (videos and articles) into GUI agent trajectory data, create GUI-Net dataset, and fine-tune Qwen2.5-VL-3B/7B models on this dataset.

Result: Produced GUI-Net dataset with 143K trajectory data across 5 operating systems and 200+ applications. TongUI agent outperforms baseline agents by about 10% on multiple grounding and navigation benchmarks.

Conclusion: The GUI-Net dataset and TongUI framework effectively address the data scarcity problem for GUI agents, showing significant performance improvements and demonstrating the value of learning from web tutorials.

Abstract: Building Graphical User Interface (GUI) agents is a promising research direction, which simulates human interaction with computers or mobile phones to perform diverse GUI tasks. However, a major challenge in developing generalized GUI agents is the lack of sufficient trajectory data across various operating systems and applications, mainly due to the high cost of manual annotations. In this paper, we propose the TongUI framework that builds generalized GUI agents by learning from rich multimodal web tutorials. Concretely, we crawl and process online GUI tutorials (such as videos and articles) into GUI agent trajectory data, through which we produce the GUI-Net dataset containing 143K trajectory data across five operating systems and more than 200 applications. We develop the TongUI agent by fine-tuning Qwen2.5-VL-3B/7B models on GUI-Net, which show remarkable performance improvements on commonly used grounding and navigation benchmarks, outperforming baseline agents about 10% on multiple benchmarks, showing the effectiveness of the GUI-Net dataset and underscoring the significance of our TongUI framework. We will fully open-source the code, the GUI-Net dataset, and the trained models soon.

[187] UINO-FSS: Unifying Representation Learning and Few-shot Segmentation via Hierarchical Distillation and Mamba-HyperCorrelation

Wei Zhuo, Zhiyue Tang, Wufeng Xue, Hao Ding, Junkai Ji, Linlin Shen

Main category: cs.CV

TL;DR: UINO-FSS is a unified few-shot semantic segmentation framework that integrates DINOv2 and SAM foundation models through coarse-to-fine multimodal distillation, achieving state-of-the-art performance.

DetailsMotivation: To address limitations of dual-branch architectures in few-shot segmentation by creating a unified model that integrates knowledge from different foundation architectures, overcoming misalignment between class-agnostic segmentation and fine-grained representations.

Method: Uses early-stage DINOv2 features that exhibit distribution consistency with SAM embeddings, enabling single-encoder architecture with bottleneck adapter, meta-visual prompt generator, mask decoder, and hierarchical cross-model distillation enhanced by Mamba-based 4D correlation mining.

Result: Achieves new state-of-the-art results: mIoU of 80.6 (+3.8%) on PASCAL-5^i and 64.5 (+4.1%) on COCO-20^i under 1-shot setting.

Conclusion: The unified approach effectively integrates knowledge from different foundation models through multimodal distillation, demonstrating superior performance in few-shot semantic segmentation.

Abstract: Few-shot semantic segmentation has attracted growing interest for its ability to generalize to novel object categories using only a few annotated samples. To address data scarcity, recent methods incorporate multiple foundation models to improve feature transferability and segmentation performance. However, they often rely on dual-branch architectures that combine pre-trained encoders to leverage complementary strengths, a design that limits flexibility and efficiency. This raises a fundamental question: can we build a unified model that integrates knowledge from different foundation architectures? Achieving this is, however, challenging due to the misalignment between class-agnostic segmentation capabilities and fine-grained discriminative representations. To this end, we present UINO-FSS, a novel framework built on the key observation that early-stage DINOv2 features exhibit distribution consistency with SAM’s output embeddings. This consistency enables the integration of both models’ knowledge into a single-encoder architecture via coarse-to-fine multimodal distillation. In particular, our segmenter consists of three core components: a bottleneck adapter for embedding alignment, a meta-visual prompt generator that leverages dense similarity volumes and semantic embeddings, and a mask decoder. Using hierarchical cross-model distillation, we effectively transfer SAM’s knowledge into the segmenter, further enhanced by Mamba-based 4D correlation mining on support-query pairs. Extensive experiments on PASCAL-5$^i$ and COCO-20$^i$ show that UINO-FSS achieves new state-of-the-art results under the 1-shot setting, with mIoU of 80.6 (+3.8%) on PASCAL-5$^i$ and 64.5 (+4.1%) on COCO-20$^i$, demonstrating the effectiveness of our unified approach.

[188] MOON: Generative MLLM-based Multimodal Representation Learning for E-commerce Product Understanding

Daoze Zhang, Chenghan Fu, Zhanheng Nie, Jianyu Liu, Wanxian Guan, Yuan Gao, Jun Song, Pengjie Wang, Jian Xu, Bo Zheng

Main category: cs.CV

TL;DR: MOON is the first generative MLLM-based model for product representation learning that addresses multimodal alignment challenges through guided MoE modules, core semantic region detection, and specialized negative sampling.

DetailsMotivation: Existing discriminative dual-flow architectures struggle with many-to-one alignment between multiple product images and texts, while generative MLLMs show potential but face challenges in multimodal modeling, background noise, and lack of standard benchmarks.

Method: Proposes MOON with: (1) guided Mixture-of-Experts for multimodal and aspect-specific modeling, (2) core semantic region detection to reduce background noise interference, (3) specialized negative sampling strategy for increased difficulty and diversity.

Result: Demonstrates competitive zero-shot performance on both the proposed MBE benchmark and public datasets, with strong generalization across cross-modal retrieval, product classification, and attribute prediction tasks.

Conclusion: MOON effectively addresses key challenges in product representation learning and shows promising results through case studies and visualizations, establishing a foundation for generative MLLM applications in e-commerce product understanding.

Abstract: With the rapid advancement of e-commerce, exploring general representations rather than task-specific ones has attracted increasing research attention. For product understanding, although existing discriminative dual-flow architectures drive progress in this field, they inherently struggle to model the many-to-one alignment between multiple images and texts of products. Therefore, we argue that generative Multimodal Large Language Models (MLLMs) hold significant potential for improving product representation learning. Nevertheless, achieving this goal still remains non-trivial due to several key challenges: the lack of multimodal and aspect-aware modeling modules in typical LLMs; the common presence of background noise in product images; and the absence of a standard benchmark for evaluation. To address these issues, we propose the first generative MLLM-based model named MOON for product representation learning. Our method (1) employs a guided Mixture-of-Experts (MoE) module for targeted modeling of multimodal and aspect-specific product content; (2) effectively detects core semantic regions in product images to mitigate the distraction and interference caused by background noise; and (3) introduces the specialized negative sampling strategy to increase the difficulty and diversity of negative samples. In addition, we release a large-scale multimodal benchmark MBE for various product understanding tasks. Experimentally, our model demonstrates competitive zero-shot performance on both our benchmark and the public dataset, showcasing strong generalization across various downstream tasks, including cross-modal retrieval, product classification, and attribute prediction. Furthermore, the case study and visualization illustrate the effectiveness of MOON for product understanding.

[189] ToDRE: Effective Visual Token Pruning via Token Diversity and Task Relevance

Duo Li, Zuhao Yang, Xiaoqin Zhang, Ling Shao, Shijian Lu

Main category: cs.CV

TL;DR: ToDRE is a training-free visual token pruning framework that uses token diversity and task relevance to compress 90% of visual tokens, achieving 2.6x speed-up while maintaining 95% performance.

DetailsMotivation: Existing visual token pruning methods use single metrics like cross-modal attention or token similarity, but visual token diversity and task-specific relevance are two orthogonal factors that should be treated separately for more effective pruning.

Method: Two-stage framework: 1) Uses greedy max-sum diversification to select diverse representative tokens after vision encoder, 2) Implements ‘information migration’ to eliminate task-irrelevant tokens in certain LLM decoder layers.

Result: Prunes 90% of visual tokens after vision encoder and all visual tokens in certain LLM decoder layers, achieving 2.6x speed-up in total inference time while maintaining 95.0% model performance with excellent compatibility.

Conclusion: Treating token diversity and task relevance as separate orthogonal factors enables more effective visual token compression, significantly improving LVLM inference efficiency without sacrificing performance.

Abstract: Visual token pruning aims to compress and prune redundant visual tokens which play a critical role in efficient inference with large vision-language models (LVLMs). However, most existing work estimates visual redundancy using a single metric, such as cross-modal attention or visual token similarity. We show that visual token diversity and task-specific token relevance are two crucial yet orthogonal factors that complement each other in conveying useful information and should therefore be treated separately for more effective visual token pruning. Building upon this insight, we design TODRE, a two-stage and training-free framework that incorporates Token Diversity and task RElevance for effective token compression and efficient LVLM inference. Instead of pruning redundant tokens, we introduce a greedy max-sum diversification algorithm that selects and retains a subset of diverse and representative visual tokens after the vision encoder. On top of that, ToDRE leverages an “information migration” mechanism to eliminate task-irrelevant visual tokens within certain decoder layers of large language model(LLM) to further improve token pruning and LVLM inference. Extensive experiments show that ToDRE prunes 90% of visual tokens after the vision encoder as well as all visual tokens in certain LLM decoder layers, leading to a 2.6x speed-up in total inference time while maintaining 95.0% model performance plus excellent model compatibility.

[190] Deep Spectral Prior

Yanqi Cheng, Xuxiang Zhao, Tieyong Zeng, Pietro Lio, Carola-Bibiane Schönlieb, Angelica I Aviles-Rivero

Main category: cs.CV

TL;DR: Deep Spectral Prior (DSP) is an unsupervised image reconstruction framework that operates in the frequency domain, overcoming Deep Image Prior’s overfitting issues through joint amplitude-phase learning and automatic spectral mode separation.

DetailsMotivation: To address Deep Image Prior's major limitations of pixel-level optimization sensitivity to overfitting and reliance on manual early stopping, by developing a frequency-domain approach that captures full spectral structure.

Method: Joint learning of amplitude and phase in complex frequency domain, with frequency-dependent optimization dynamics that separate low-frequency informative modes from high-frequency noise, inducing implicit projection onto frequency-consistent manifold.

Result: DSP consistently surpasses DIP and other unsupervised baselines in denoising, inpainting, and deblurring tasks, achieving superior fidelity, robustness, and theoretical interpretability without explicit priors or supervision.

Conclusion: DSP provides a unified, unsupervised data-free framework that eliminates DIP’s overfitting issues through automatic spectral regularization, enabling stable, physically plausible reconstructions with formal theoretical guarantees.

Abstract: We introduce the Deep Spectral Prior (DSP), a new framework for unsupervised image reconstruction that operates entirely in the complex frequency domain. Unlike the Deep Image Prior (DIP), which optimises pixel-level errors and is highly sensitive to overfitting, DSP performs joint learning of amplitude and phase to capture the full spectral structure of images. We derive a rigorous theoretical characterisation of DSP’s optimisation dynamics, proving that it follows frequency-dependent descent trajectories that separate informative low-frequency modes from stochastic high-frequency noise. This spectral mode separation explains DSP’s self-regularising behaviour and, for the first time, formally establishes the elimination of DIP’s major limitation-its reliance on manual early stopping. Moreover, DSP induces an implicit projection onto a frequency-consistent manifold, ensuring convergence to stable, physically plausible reconstructions without explicit priors or supervision. Extensive experiments on denoising, inpainting, and deblurring demonstrate that DSP consistently surpasses DIP and other unsupervised baselines, achieving superior fidelity, robustness, and theoretical interpretability within a unified, unsupervised data-free framework.

[191] ReassembleNet: Learnable Keypoints and Diffusion for 2D Fresco Reconstruction

Adeela Islam, Stefano Fiorini, Stuart James, Pietro Morerio, Alessio Del Bue

Main category: cs.CV

TL;DR: ReassembleNet addresses limitations in DL reassembly methods by using contour keypoints and GNN pooling to reduce complexity, integrate multimodal features, and apply diffusion-based pose estimation, achieving significant improvements in rotation and translation accuracy.

DetailsMotivation: Current Deep Learning methods for reassembly lack scalability, multimodality, and real-world applicability for complex shapes and realistic erosion problems.

Method: Represent pieces as contour keypoints, use GNN pooling to select informative keypoints, integrate geometric and texture features, pretrain on semi-synthetic data, and apply diffusion-based pose estimation.

Result: 57% improvement in RMSE Rotation and 87% improvement in RMSE Translation compared to prior methods.

Conclusion: ReassembleNet effectively addresses key limitations in reassembly tasks through reduced complexity, multimodal integration, and improved pose estimation accuracy.

Abstract: The task of reassembly is a significant challenge across multiple domains, including archaeology, genomics, and molecular docking, requiring the precise placement and orientation of elements to reconstruct an original structure. In this work, we address key limitations in state-of-the-art Deep Learning methods for reassembly, namely i) scalability; ii) multimodality; and iii) real-world applicability: beyond square or simple geometric shapes, realistic and complex erosion, or other real-world problems. We propose ReassembleNet, a method that reduces complexity by representing each input piece as a set of contour keypoints and learning to select the most informative ones by Graph Neural Networks pooling inspired techniques. ReassembleNet effectively lowers computational complexity while enabling the integration of features from multiple modalities, including both geometric and texture data. Further enhanced through pretraining on a semi-synthetic dataset. We then apply diffusion-based pose estimation to recover the original structure. We improve on prior methods by 57% and 87% for RMSE Rotation and Translation, respectively.

[192] Gaussian Mapping for Evolving Scenes

Vladimir Yugay, Thies Kersten, Luca Carlone, Theo Gevers, Martin R. Oswald, Lukas Schmid

Main category: cs.CV

TL;DR: A dynamic scene-adaptation mechanism for 3D Gaussian Splatting that continuously updates scene representations to handle long-term scene evolution, with a novel keyframe management system to maintain consistency.

DetailsMotivation: Current 3DGS mapping systems are limited to static scenes, while long-term dynamics (scene evolution out of view) remain underexplored, creating a need for systems that can adapt to changing environments.

Method: Proposes a dynamic scene-adaptation mechanism that continuously updates 3DGS, coupled with a novel keyframe management system that discards outdated observations while preserving relevant information to maintain reconstruction consistency.

Result: Achieves 29.7% improvement in PSNR and 3x improvement in L1 depth error over the most competitive baseline on both synthetic and real-world datasets.

Conclusion: The proposed approach successfully addresses long-term scene dynamics in 3DGS mapping, enabling continuous adaptation to evolving environments while maintaining reconstruction quality.

Abstract: Mapping systems with novel view synthesis (NVS) capabilities, most notably 3D Gaussian Splatting (3DGS), are widely used in computer vision and across various applications, including augmented reality, robotics, and autonomous driving. However, many current approaches are limited to static scenes. While recent works have begun addressing short-term dynamics (motion within the camera’s view), long-term dynamics (the scene evolving through changes out of view) remain less explored. To overcome this limitation, we introduce a dynamic scene-adaptation mechanism that continuously updates 3DGS to reflect the latest changes. Since maintaining consistency remains challenging due to stale observations that disrupt the reconstruction process, we propose a novel keyframe management mechanism that discards outdated observations while preserving as much information as possible. We thoroughly evaluate Gaussian Mapping for Evolving Scenes (\ours) on both synthetic and real-world datasets, achieving a 29.7% improvement in PSNR and a 3 times improvement in L1 depth error over the most competitive baseline.

[193] Underage Detection through a Multi-Task and MultiAge Approach for Screening Minors in Unconstrained Imagery

Christopher Gaul, Eduardo Fidalgo, Enrique Alegre, Rocío Alaiz Rodríguez, Eri Pérez Corral

Main category: cs.CV

TL;DR: Proposes a multi-task architecture for automatic minor screening in images using frozen FaRL backbone with age regression and binary underage classification heads, achieving improved accuracy and robustness to domain shifts.

DetailsMotivation: Address challenges in automatic minor screening including distribution shift, under-representation of children in datasets, and legal requirements for accurate age discrimination.

Method: Multi-task architecture with frozen FaRL vision-language backbone, compact MLP sharing features across age-regression head and four binary underage heads (12, 15, 18, 21 years), using α-reweighted focal loss and age-balanced sampling.

Result: Reduces mean absolute error from 4.175 to 4.068 years on ASORES-39k, improves under-18 detection F2 score from 0.801 to 0.857 at 1% false-adult rate. Maintains 0.99 recall on ASWIFT-20k with F2 rising from 0.742 to 0.833.

Conclusion: The proposed multi-age model demonstrates effective minor screening with improved accuracy and robustness to real-world domain shifts while focusing on legally critical age ranges.

Abstract: Accurate automatic screening of minors in unconstrained images requires models robust to distribution shift and resilient to the under-representation of children in public datasets. To address these issues, we propose a multi-task architecture with dedicated under/over-age discrimination tasks based on a frozen FaRL vision-language backbone joined with a compact two-layer MLP that shares features across one age-regression head and four binary underage heads (12, 15, 18, and 21 years). This design focuses on the legally critical age range while keeping the backbone frozen. Class imbalance is mitigated through an $α$-reweighted focal loss and age-balanced mini-batch sampling, while an age gap removes ambiguous samples near thresholds. Evaluation is conducted on our new Overall Underage Benchmark (303k cleaned training images, 110k test images), defining both the “ASORES-39k” restricted overall test, which removes the noisiest domains, and the age estimation wild-shifts test “ASWIFT-20k” of 20k-images, stressing extreme poses ($>$45°), expressions, and low image quality to emulate real-world shifts. Trained on the cleaned overall set with resampling and age gap, our multiage model “F” reduces the mean absolute error on ASORES-39k from 4.175 y (age-only baseline) to 4.068 y and improves under-18 detection from F2 score of 0.801 to 0.857 at 1% false-adult rate. Under the ASWIFT-20k, the same configuration nearly sustains 0.99 recall while F2 rises from 0.742 to 0.833, demonstrating robustness to domain shift.

[194] Accelerating Local AI on Consumer GPUs: A Hardware-Aware Dynamic Strategy for YOLOv10s

Mahmudul Islam Masum, Miad Islam

Main category: cs.CV

TL;DR: Two-Pass Adaptive Inference algorithm bridges performance gap on consumer hardware by using fast low-resolution pass and escalating to high-resolution only when needed, achieving 1.85x speedup with minimal accuracy loss.

DetailsMotivation: Address the critical gap between benchmark performance and practical viability of object detectors on resource-constrained consumer hardware like laptops with RTX 4060 GPUs, where system-level bottlenecks dominate performance.

Method: Introduces a model-independent Two-Pass Adaptive Inference algorithm that uses a fast low-resolution pass and escalates to high-resolution model pass only when detection confidence is low, without requiring architectural changes.

Result: On 5000-image COCO dataset, achieves 1.85x speedup over PyTorch Early-Exit baseline with modest mAP loss of 5.51%.

Conclusion: Provides practical blueprint for deploying high-performance real-time AI on consumer devices by shifting focus from pure model optimization to hardware-aware inference strategies that maximize throughput.

Abstract: As local AI grows in popularity, there is a critical gap between the benchmark performance of object detectors and their practical viability on consumer-grade hardware. While models like YOLOv10s promise real-time speeds, these metrics are typically achieved on high-power, desktop-class GPUs. This paper reveals that on resource-constrained systems, such as laptops with RTX 4060 GPUs, performance is not compute-bound but is instead dominated by system-level bottlenecks, as illustrated by a simple bottleneck test. To overcome this hardware-level constraint, we introduce a Two-Pass Adaptive Inference algorithm, a model-independent approach that requires no architectural changes. This study mainly focuses on adaptive inference strategies and undertakes a comparative analysis of architectural early-exit and resolution-adaptive routing, highlighting their respective trade-offs within a unified evaluation framework. The system uses a fast, low-resolution pass and only escalates to a high-resolution model pass when detection confidence is low. On a 5000-image COCO dataset, our method achieves a 1.85x speedup over a PyTorch Early-Exit baseline, with a modest mAP loss of 5.51%. This work provides a practical and reproducible blueprint for deploying high-performance, real-time AI on consumer-grade devices by shifting the focus from pure model optimization to hardware-aware inference strategies that maximize throughput.

[195] UGG-ReID: Uncertainty-Guided Graph Model for Multi-Modal Object Re-Identification

Xixi Wan, Aihua Zheng, Bo Jiang, Beibei Wang, Chenglong Li, Jin Tang

Main category: cs.CV

TL;DR: UGG-ReID is a robust multi-modal object re-identification method that uses uncertainty-guided graphs to handle local noise and effectively fuse heterogeneous modalities, achieving superior performance across five datasets.

DetailsMotivation: Address two core challenges in multi-modal object ReID: (1) learning robust features under fine-grained local noise from occlusion and disruptions, and (2) effectively integrating heterogeneous modalities for enhanced representation.

Method: Proposes uncertainty-guided graph model with Gaussian patch-graph representation to quantify local uncertainty and capture structural relationships, plus uncertainty-guided mixture of experts strategy that dynamically routes samples to low-uncertainty experts.

Result: Achieves excellent performance on five multi-modal object ReID datasets with diverse spectral modalities, significantly outperforming current methods in noise immunity.

Conclusion: UGG-ReID effectively mitigates noise interference and facilitates multi-modal fusion through uncertainty estimation and modeling, demonstrating superior robustness and performance in multi-modal object re-identification tasks.

Abstract: Multi-modal object Re-IDentification (ReID) has gained considerable attention with the goal of retrieving specific targets across cameras using heterogeneous visual data sources. At present, multi-modal object ReID faces two core challenges: (1) learning robust features under fine-grained local noise caused by occlusion, frame loss, and other disruptions; and (2) effectively integrating heterogeneous modalities to enhance multi-modal representation. To address the above challenges, we propose a robust approach named Uncertainty-Guided Graph model for multi-modal object ReID (UGG-ReID). UGG-ReID is designed to mitigate noise interference and facilitate effective multi-modal fusion by estimating both local and sample-level aleatoric uncertainty and explicitly modeling their dependencies. Specifically, we first propose the Gaussian patch-graph representation model that leverages uncertainty to quantify fine-grained local cues and capture their structural relationships. This process boosts the expressiveness of modal-specific information, ensuring that the generated embeddings are both more informative and robust. Subsequently, we design an uncertainty-guided mixture of experts strategy that dynamically routes samples to experts exhibiting low uncertainty. This strategy effectively suppresses noise-induced instability, leading to enhanced robustness. Meanwhile, we design an uncertainty-guided routing to strengthen the multi-modal interaction, improving the performance. UGG-ReID is comprehensively evaluated on five representative multi-modal object ReID datasets, encompassing diverse spectral modalities. Experimental results show that the proposed method achieves excellent performance on all datasets and is significantly better than current methods in terms of noise immunity. Our code is available at https://github.com/wanxixi11/UGG-ReID.

[196] Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs

Advait Gosai, Arun Kavishwar, Stephanie L. McNamara, Soujanya Samineni, Renato Umeton, Alexander Chowdhury, William Lotter

Main category: cs.CV

TL;DR: Evaluation of GPT-4, GPT-5, and MedGemma for pathology localization on chest X-rays shows GPT-5 leads with 49.7% accuracy but all MLLMs underperform task-specific CNNs (59.9%) and radiologists (80.1%).

DetailsMotivation: To assess MLLMs' spatial understanding of anatomy and disease through pathology localization, which is clinically relevant and provides insight beyond diagnostic capabilities.

Method: Systematic evaluation using a prompting pipeline that overlays spatial grids and elicits coordinate-based predictions on CheXlocalize dataset with nine pathologies.

Result: GPT-5 achieved 49.7% localization accuracy, GPT-4 39.1%, MedGemma 17.7%. GPT-5 made anatomically plausible predictions, GPT-4 struggled with variable findings, MedGemma improved with few-shot prompting.

Conclusion: Current MLLMs show promise but limitations in medical imaging localization, requiring integration with task-specific tools for reliable clinical use.

Abstract: Recent work has shown promising performance of frontier large language models (LLMs) and their multimodal counterparts in medical quizzes and diagnostic tasks, highlighting their potential for broad clinical utility given their accessible, general-purpose nature. However, beyond diagnosis, a fundamental aspect of medical image interpretation is the ability to localize pathological findings. Evaluating localization not only has clinical and educational relevance but also provides insight into a model’s spatial understanding of anatomy and disease. Here, we systematically assess two general-purpose MLLMs (GPT-4 and GPT-5) and a domain-specific model (MedGemma) in their ability to localize pathologies on chest radiographs, using a prompting pipeline that overlays a spatial grid and elicits coordinate-based predictions. Averaged across nine pathologies in the CheXlocalize dataset, GPT-5 exhibited a localization accuracy of 49.7%, followed by GPT-4 (39.1%) and MedGemma (17.7%), all lower than a task-specific CNN baseline (59.9%) and a radiologist benchmark (80.1%). Despite modest performance, error analysis revealed that GPT-5’s predictions were largely in anatomically plausible regions, just not always precisely localized. GPT-4 performed well on pathologies with fixed anatomical locations, but struggled with spatially variable findings and exhibited anatomically implausible predictions more frequently. MedGemma demonstrated the lowest performance on all pathologies, but showed improvements when provided examples through few shot prompting. Our findings highlight both the promise and limitations of current MLLMs in medical imaging and underscore the importance of integrating them with task-specific tools for reliable use.

[197] PointVDP: Learning View-Dependent Projection by Fireworks Rays for 3D Point Cloud Segmentation

Yang Chen, Yueqi Duan, Haowen Sun, Ziwei Wang, Jiwen Lu, Yap-Peng Tan

Main category: cs.CV

TL;DR: PointVDP introduces view-dependent projection (VDP) for efficient 3D-to-2D mapping in point cloud segmentation, using data-driven projections inspired by fireworks to generate informative single-image inputs with minimal computational overhead.

DetailsMotivation: Existing projection-based methods use view-independent projections with pre-defined parameters, limiting point awareness and projection diversity while causing computational inefficiency through redundant projections.

Method: VDP framework generates data-driven projections from 3D point distributions, predicting adaptive rays inspired by fireworks behavior, with color regularization to emphasize semantic features and suppress non-semantic ones.

Result: Experiments on S3DIS and ScanNet benchmarks show competitive performance while offering lightweight projections with marginal computation costs.

Conclusion: PointVDP provides a resource-efficient solution for semantic understanding in point cloud segmentation through adaptive, data-driven projections that maximize 2D space utilization.

Abstract: In this paper, we propose view-dependent projection (VDP) to facilitate point cloud segmentation, designing efficient 3D-to-2D mapping that dynamically adapts to the spatial geometry from view variations. Existing projection-based methods leverage view-independent projection in complex scenes, relying on straight lines to generate direct rays or upward curves to reduce occlusions. However, their view independence provides projection rays that are limited to pre-defined parameters by human settings, restricting point awareness and failing to capture sufficient projection diversity across different view planes. Although multiple projections per view plane are commonly used to enhance spatial variety, the projected redundancy leads to excessive computational overhead and inefficiency in image processing. To address these limitations, we design a framework of VDP to generate data-driven projections from 3D point distributions, producing highly informative single-image inputs by predicting rays inspired by the adaptive behavior of fireworks. In addition, we construct color regularization to optimize the framework, which emphasizes essential features within semantic pixels and suppresses the non-semantic features within black pixels, thereby maximizing 2D space utilization in a projected image. As a result, our approach, PointVDP, develops lightweight projections in marginal computation costs. Experiments on S3DIS and ScanNet benchmarks show that our approach achieves competitive results, offering a resource-efficient solution for semantic understanding.

[198] Gene-DML: Dual-Pathway Multi-Level Discrimination for Gene Expression Prediction from Histopathology Images

Yaxuan Song, Jianan Fan, Hang Chang, Weidong Cai

Main category: cs.CV

TL;DR: Gene-DML is a unified framework that enhances gene expression prediction from histopathology images through dual-pathway multi-level discrimination, achieving state-of-the-art performance by better aligning morphological and transcriptional representations.

DetailsMotivation: Existing methods underutilize cross-modal representation alignment between histopathology images and gene expression profiles across multiple representational levels, limiting prediction performance for scalable molecular profiling in precision medicine.

Method: Proposes Gene-DML with dual-pathway multi-level discrimination: (1) multi-scale instance-level discrimination aligning hierarchical histopathology representations at local, neighbor, and global levels with gene expression, and (2) cross-level instance-group discrimination enforcing structural consistency between individual instances and modality-crossed groups.

Result: Extensive experiments on public spatial transcriptomics datasets demonstrate that Gene-DML achieves state-of-the-art performance in gene expression prediction, with enhanced predictive accuracy and generalization across diverse biological contexts.

Conclusion: Gene-DML learns robust cross-modal representations by jointly modeling fine-grained and structural-level discrimination, significantly improving gene expression prediction from histopathology images for computational pathology applications.

Abstract: Accurately predicting gene expression from histopathology images offers a scalable and non-invasive approach to molecular profiling, with significant implications for precision medicine and computational pathology. However, existing methods often underutilize the cross-modal representation alignment between histopathology images and gene expression profiles across multiple representational levels, thereby limiting their prediction performance. To address this, we propose Gene-DML, a unified framework that structures latent space through Dual-pathway Multi-Level discrimination to enhance correspondence between morphological and transcriptional modalities. The multi-scale instance-level discrimination pathway aligns hierarchical histopathology representations extracted at local, neighbor, and global levels with gene expression profiles, capturing scale-aware morphological-transcriptional relationships. In parallel, the cross-level instance-group discrimination pathway enforces structural consistency between individual (image/gene) instances and modality-crossed (gene/image, respectively) groups, strengthening the alignment across modalities. By jointly modeling fine-grained and structural-level discrimination, Gene-DML is able to learn robust cross-modal representations, enhancing both predictive accuracy and generalization across diverse biological contexts. Extensive experiments on public spatial transcriptomics datasets demonstrate that Gene-DML achieves state-of-the-art performance in gene expression prediction. The code and processed datasets are available at https://github.com/YXSong000/Gene-DML.

[199] CST Anti-UAV: A Thermal Infrared Benchmark for Tiny UAV Tracking in Complex Scenes

Bin Xie, Congxuan Zhang, Fagan Wang, Peng Liu, Feng Lu, Zhen Chen, Weiming Hu

Main category: cs.CV

TL;DR: CST Anti-UAV is a new thermal infrared dataset for tracking tiny UAVs in complex scenes, containing 220 sequences with 240k+ annotations. Evaluation of 20 SOT methods shows poor performance (35.92% accuracy), highlighting limitations of existing benchmarks.

DetailsMotivation: Existing UAV tracking datasets lack diversity in scene complexity and attribute representation, limiting applicability to real-world anti-UAV scenarios. There's a need for datasets that feature tiny UAV targets in complex environments.

Method: Created CST Anti-UAV dataset with 220 video sequences and 240k+ high-quality bounding box annotations, featuring tiny UAV targets in diverse complex scenes. Includes complete manual frame-level attribute annotations for precise evaluation. Evaluated 20 existing SOT methods on this dataset.

Result: State-of-the-art SOT methods achieve only 35.92% accuracy on CST Anti-UAV, significantly lower than the 67.69% observed on Anti-UAV410 dataset. This demonstrates that tracking tiny UAVs in complex environments remains a major challenge.

Conclusion: The CST Anti-UAV benchmark reveals limitations of existing UAV tracking methods and datasets. Public release of this dataset will foster development of more robust SOT methods and drive innovation in anti-UAV systems.

Abstract: The widespread application of Unmanned Aerial Vehicles (UAVs) has raised serious public safety and privacy concerns, making UAV perception crucial for anti-UAV tasks. However, existing UAV tracking datasets predominantly feature conspicuous objects and lack diversity in scene complexity and attribute representation, limiting their applicability to real-world scenarios. To overcome these limitations, we present the CST Anti-UAV, a new thermal infrared dataset specifically designed for Single Object Tracking (SOT) in Complex Scenes with Tiny UAVs (CST). It contains 220 video sequences with over 240k high-quality bounding box annotations, highlighting two key properties: a significant number of tiny-sized UAV targets and the diverse and complex scenes. To the best of our knowledge, CST Anti-UAV is the first dataset to incorporate complete manual frame-level attribute annotations, enabling precise evaluations under varied challenges. To conduct an in-depth performance analysis for CST Anti-UAV, we evaluate 20 existing SOT methods on the proposed dataset. Experimental results demonstrate that tracking tiny UAVs in complex environments remains a challenge, as the state-of-the-art method achieves only 35.92% state accuracy, much lower than the 67.69% observed on the Anti-UAV410 dataset. These findings underscore the limitations of existing benchmarks and the need for further advancements in UAV tracking research. The CST Anti-UAV benchmark is about to be publicly released, which not only fosters the development of more robust SOT methods but also drives innovation in anti-UAV systems.

[200] A Denoising Framework for Real-World Ultra-Low-Dose Lung CT Images Based on an Image Purification Strategy

Guoliang Gong, Man Yu

Main category: cs.CV

TL;DR: This paper introduces a real-world paired lung dataset (Patient-uLDCT) with ultra-low-dose CT images (2% of normal dose) and a Frequency-domain Flow Matching model to address domain-shift issues in low-dose CT image enhancement.

DetailsMotivation: To overcome limitations of synthetic LDCT training data that cause domain-shift problems and reduce practical effectiveness of AI-based CT enhancement algorithms in real clinical scenarios.

Method: Constructed real-world paired lung dataset with multiple scans on volunteers (2% radiation dose), proposed purification strategy for anatomical alignment, and developed Frequency-domain Flow Matching model for image reconstruction.

Result: Created a dataset with substantially lower radiation dose (2% vs conventional 25% low-dose and 10% ultra-low-dose) and achieved excellent image reconstruction performance with the proposed FFM model.

Conclusion: The real-world paired dataset and Frequency-domain Flow Matching model effectively address domain-shift issues in low-dose CT image enhancement, enabling more practical clinical applications.

Abstract: Computed Tomography (CT) is a vital diagnostic tool in clinical practice, yet the health risks associated with ionizing radiation cannot be overlooked. Low-dose CT (LDCT) helps mitigate radiation exposure but simultaneously leads to reduced image quality. Consequently, researchers have sought to reconstruct clear images from LDCT scans using artificial intelligence-based image enhancement techniques. However, these studies typically rely on synthetic LDCT images for algorithm training, which introduces significant domain-shift issues and limits the practical effectiveness of these algorithms in real-world scenarios. To address this challenge, we constructed a real-world paired lung dataset, referred to as Patient-uLDCT (ultra-low-dose CT), by performing multiple scans on volunteers. The radiation dose for the low-dose images in this dataset is only 2% of the normal dose, substantially lower than the conventional 25% low-dose and 10% ultra-low-dose levels. Furthermore, to resolve the anatomical misalignment between normal-dose and uLDCT images caused by respiratory motion during acquisition, we propose a novel purification strategy to construct corresponding aligned image pairs. Finally, we introduce a Frequency-domain Flow Matching model (FFM) that achieves excellent image reconstruction performance. Code is available at https://github.com/MonkeyDadLufy/flow-matching.

[201] RetinexDual: Retinex-based Dual Nature Approach for Generalized Ultra-High-Definition Image Restoration

Mohab Kishawy, Ali Abdellatif Hussein, Jun Chen

Main category: cs.CV

TL;DR: RetinexDual is a novel Retinex theory-based framework for Ultra-High-Definition Image Restoration that uses two complementary sub-networks (SAMBA and FIA) to overcome limitations of traditional methods like downsampling and frequency-domain approaches.

DetailsMotivation: Traditional UHD IR methods face significant drawbacks: extreme downsampling causes irreversible information loss, while pure frequency-domain approaches are ineffective for spatially confined artifacts due to loss of degradation locality.

Method: RetinexDual leverages two sub-networks: SAMBA (Scale-Attentive maMBA) for reflectance correction using coarse-to-fine mechanism, and FIA (Frequency Illumination Adaptor) for precise color and illumination correction in frequency domain.

Result: RetinexDual outperforms recent methods qualitatively and quantitatively on four UHD IR tasks: deraining, deblurring, dehazing, and Low-Light Image Enhancement.

Conclusion: The framework demonstrates the importance of distinct designs for each branch and the effectiveness of its components, providing superior performance for generalized UHD image restoration tasks.

Abstract: Advancements in image sensing have elevated the importance of Ultra-High-Definition Image Restoration (UHD IR). Traditional methods, such as extreme downsampling or transformation from the spatial to the frequency domain, encounter significant drawbacks: downsampling induces irreversible information loss in UHD images, while our frequency analysis reveals that pure frequency-domain approaches are ineffective for spatially confined image artifacts, primarily due to the loss of degradation locality. To overcome these limitations, we present RetinexDual, a novel Retinex theory-based framework designed for generalized UHD IR tasks. RetinexDual leverages two complementary sub-networks: the Scale-Attentive maMBA (SAMBA) and the Frequency Illumination Adaptor (FIA). SAMBA, responsible for correcting the reflectance component, utilizes a coarse-to-fine mechanism to overcome the causal modeling of mamba, which effectively reduces artifacts and restores intricate details. On the other hand, FIA ensures precise correction of color and illumination distortions by operating in the frequency domain and leveraging the global context provided by it. Evaluating RetinexDual on four UHD IR tasks, namely deraining, deblurring, dehazing, and Low-Light Image Enhancement (LLIE), shows that it outperforms recent methods qualitatively and quantitatively. Ablation studies demonstrate the importance of employing distinct designs for each branch in RetinexDual, as well as the effectiveness of its various components.

[202] ViewBridge:Revisiting Cross-View Localization from Image Matching

Panwang Xia, Qiong Wu, Lei Yu, Yi Liu, Mingtao Xiong, Xudong Lu, Yi Liu, Haoyu Guo, Yongxiang Yao, Junjian Zhang, Xiangyuan Cai, Hongwei Hu, Zhi Zheng, Yongjun Zhang, Yi Wan

Main category: cs.CV

TL;DR: A unified framework for cross-view localization that enhances matching and localization through geometric constraints and similarity refinement, achieving geometry-consistent correspondences across extreme viewpoints.

DetailsMotivation: Existing cross-view localization methods fail to establish fine-grained and geometrically reliable correspondences under large viewpoint variations, limiting accuracy and interpretability.

Method: Introduces a Surface Model for geometrically consistent BEV feature projection and SimRefiner for adaptive similarity distribution refinement, plus CVFM benchmark with pixel-level correspondences.

Result: Achieves geometry-consistent and fine-grained correspondences across extreme viewpoints, improving accuracy and stability of cross-view localization.

Conclusion: The proposed framework successfully addresses geometric consistency and match reliability in cross-view localization, demonstrating improved performance through unified matching and localization approach.

Abstract: Cross-view localization aims to estimate the 3-DoF pose of a ground-view image by aligning it with aerial or satellite imagery. Existing methods typically address this task through direct regression or feature alignment in a shared bird’s-eye view (BEV) space. Although effective for coarse alignment, these methods fail to establish fine-grained and geometrically reliable correspondences under large viewpoint variations, thereby limiting both the accuracy and interpretability of localization results. Consequently, we revisit cross-view localization from the perspective of image matching and propose a unified framework that enhances both matching and localization. Specifically, we introduce a Surface Model that constrains BEV feature projection to physically valid regions for geometric consistency, and a SimRefiner that adaptively refines similarity distributions to enhance match reliability. To further support research in this area, we present CVFM, the first benchmark with 32,509 cross-view image pairs annotated with pixel-level correspondences. Extensive experiments demonstrate that our approach achieves geometry-consistent and fine-grained correspondences across extreme viewpoints and further improves the accuracy and stability of cross-view localization.

[203] UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning

Tiancheng Gu, Kaicheng Yang, Kaichen Zhang, Xiang An, Ziyong Feng, Yueyi Zhang, Weidong Cai, Jiankang Deng, Lidong Bing

Main category: cs.CV

TL;DR: UniME-V2 is a universal multimodal embedding model that uses MLLMs to improve hard negative mining and semantic discrimination, achieving SOTA performance on retrieval tasks.

DetailsMotivation: Existing multimodal embedding models struggle with capturing subtle semantic differences, lack negative sample diversity, and have limited ability to distinguish false/hard negatives.

Method: Uses MLLM-as-a-Judge mechanism to assess semantic alignment and generate soft matching scores for hard negative mining, plus a reranker trained on mined negatives with joint pairwise and listwise optimization.

Result: Achieves state-of-the-art performance on MMEB benchmark and multiple retrieval tasks across all tasks on average.

Conclusion: The proposed approach effectively enhances discriminative capacity and semantic understanding in multimodal embeddings through advanced MLLM-based hard negative mining.

Abstract: Universal multimodal embedding models are foundational to various tasks. Existing approaches typically employ in-batch negative mining by measuring the similarity of query-candidate pairs. However, these methods often struggle to capture subtle semantic differences among candidates and lack diversity in negative samples. Moreover, the embeddings exhibit limited discriminative ability in distinguishing false and hard negatives. In this paper, we leverage the advanced understanding capabilities of MLLMs to enhance representation learning and present a novel Universal Multimodal Embedding (UniME-V2) model. Our approach first constructs a potential hard negative set through global retrieval. We then introduce the MLLM-as-a-Judge mechanism, which utilizes MLLMs to assess the semantic alignment of query-candidate pairs and generate soft semantic matching scores. These scores serve as a foundation for hard negative mining, mitigating the impact of false negatives and enabling the identification of diverse, high-quality hard negatives. Furthermore, the semantic matching scores are used as soft labels to mitigate the rigid one-to-one mapping constraint. By aligning the similarity matrix with the soft semantic matching score matrix, the model learns semantic distinctions among candidates, significantly enhancing its discriminative capacity. To further improve performance, we propose UniME-V2-Reranker, a reranking model trained on our mined hard negatives through a joint pairwise and listwise optimization approach. We conduct comprehensive experiments on the MMEB benchmark and multiple retrieval tasks, demonstrating that our method achieves state-of-the-art performance on average across all tasks.

[204] Drifting Away from Truth: GenAI-Driven News Diversity Challenges LVLM-Based Misinformation Detection

Fanxiao Li, Jiaying Wu, Tingchao Fu, Yunyun Dong, Bingbing Song, Wei Zhou

Main category: cs.CV

TL;DR: GenAI-driven news diversity causes multi-level drift that degrades multimodal misinformation detection systems, with current LVLM-based detectors showing significant performance drops and unstable reasoning.

DetailsMotivation: The proliferation of multimodal misinformation and rise of GenAI tools create highly varied content that challenges existing detection systems, requiring systematic study of their vulnerabilities.

Method: Introduce DriftBench benchmark with 16,000 news instances across six diversification categories, evaluating three tasks: truth verification robustness, adversarial evidence contamination susceptibility, and reasoning consistency analysis.

Result: Experiments with six state-of-the-art LVLM detectors show substantial performance drops (average F1 -14.8%), increasingly unstable reasoning traces, and severe failures under adversarial evidence injection.

Conclusion: Current MMD systems have fundamental vulnerabilities to GenAI-driven diversity, highlighting urgent need for more resilient approaches in the GenAI era.

Abstract: The proliferation of multimodal misinformation poses growing threats to public discourse and societal trust. While Large Vision-Language Models (LVLMs) have enabled recent progress in multimodal misinformation detection (MMD), the rise of generative AI (GenAI) tools introduces a new challenge: GenAI-driven news diversity, characterized by highly varied and complex content. We show that this diversity induces multi-level drift, comprising (1) model-level misperception drift, where stylistic variations disrupt a model’s internal reasoning, and (2) evidence-level drift, where expression diversity degrades the quality or relevance of retrieved external evidence. These drifts significantly degrade the robustness of current LVLM-based MMD systems. To systematically study this problem, we introduce DriftBench, a large-scale benchmark comprising 16,000 news instances across six categories of diversification. We design three evaluation tasks: (1) robustness of truth verification under multi-level drift; (2) susceptibility to adversarial evidence contamination generated by GenAI; and (3) analysis of reasoning consistency across diverse inputs. Experiments with six state-of-the-art LVLM-based detectors show substantial performance drops (average F1 -14.8%) and increasingly unstable reasoning traces, with even more severe failures under adversarial evidence injection. Our findings uncover fundamental vulnerabilities in existing MMD systems and suggest an urgent need for more resilient approaches in the GenAI era.

[205] Arbitrary-Scale 3D Gaussian Super-Resolution

Huimin Zeng, Yue Bai, Yun Fu

Main category: cs.CV

TL;DR: A novel 3D Gaussian Splatting framework that enables arbitrary-scale super-resolution rendering with a single model, addressing aliasing artifacts and maintaining real-time performance.

DetailsMotivation: Existing 3DGS super-resolution methods are limited to fixed scale factors and suffer from aliasing when rendering arbitrary scales, while post-processing approaches reduce efficiency.

Method: Integrated framework combining scale-aware rendering, generative prior-guided optimization, and progressive super-resolving to enable arbitrary-scale HR rendering with a single 3D model.

Result: Achieves 6.59 dB PSNR gain over vanilla 3DGS, preserves structural consistency across scales, and maintains real-time rendering at 85 FPS for 1080p resolution.

Conclusion: The proposed method successfully enables high-quality arbitrary-scale super-resolution rendering with a single 3D model while maintaining efficiency and structural consistency.

Abstract: Existing 3D Gaussian Splatting (3DGS) super-resolution methods typically perform high-resolution (HR) rendering of fixed scale factors, making them impractical for resource-limited scenarios. Directly rendering arbitrary-scale HR views with vanilla 3DGS introduces aliasing artifacts due to the lack of scale-aware rendering ability, while adding a post-processing upsampler for 3DGS complicates the framework and reduces rendering efficiency. To tackle these issues, we build an integrated framework that incorporates scale-aware rendering, generative prior-guided optimization, and progressive super-resolving to enable 3D Gaussian super-resolution of arbitrary scale factors with a single 3D model. Notably, our approach supports both integer and non-integer scale rendering to provide more flexibility. Extensive experiments demonstrate the effectiveness of our model in rendering high-quality arbitrary-scale HR views (6.59 dB PSNR gain over 3DGS) with a single model. It preserves structural consistency with LR views and across different scales, while maintaining real-time rendering speed (85 FPS at 1080p).

[206] ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning

Wenjie Zhu, Yabin Zhang, Xin Jin, Wenjun Zeng, Lei Zhang

Main category: cs.CV

TL;DR: ANTS uses multimodal LLMs to create adaptive negative textual spaces for OOD detection by generating expressive negative sentences from historical OOD samples and visually similar negative labels for near-OOD cases, achieving state-of-the-art performance on ImageNet.

DetailsMotivation: Existing OOD detection methods lack understanding of OOD images and struggle with near-OOD detection due to absence of semantically similar negative labels to ID labels.

Method: Leverage MLLMs to: 1) describe cached OOD images for expressive negative sentences (far-OOD), 2) generate visually similar negative labels for ID classes resembling historical test images (near-OOD), and 3) use adaptive weighted score to balance both spaces.

Result: On ImageNet benchmark, reduces FPR95 by 3.1%, establishing new state-of-the-art. Method is training-free and zero-shot, enabling high scalability.

Conclusion: ANTS effectively addresses both far-OOD and near-OOD detection by creating adaptive negative textual spaces through MLLM reasoning, achieving superior performance while maintaining scalability.

Abstract: The introduction of negative labels (NLs) has proven effective in enhancing Out-of-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. Furthermore, the absence of negative labels semantically similar to ID labels constrains their capability in near-OOD detection. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we cache images likely to be OOD samples from the historical test images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we cache the subset of ID classes that are visually similar to historical test images and then leverage MLLM reasoning to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD), making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 3.1%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability.

[207] UNIV: Unified Foundation Model for Infrared and Visible Modalities

Fangyuan Mao, Shuo Wang, Jilin Mei, Shun Lu, Chen Min, Fuyang Liu, Xiaokun Feng, Meiqi Wu, Yu Hu

Main category: cs.CV

TL;DR: UNIV is a unified foundation model for RGB and infrared modalities that addresses cross-modal degradation through patch cross-modal contrastive learning, achieving superior performance on infrared tasks while maintaining RGB accuracy.

DetailsMotivation: Foundation models suffer from substantial cross-modal degradation between RGB and infrared modalities due to pattern shortcuts that prioritize superficial sensor patterns over underlying semantics, limiting robustness in diverse weather and illumination conditions.

Method: UNIV uses Patch Cross-modal Contrastive Learning (PCCL) - a self-supervised strategy that constructs a unified cross-modal feature space by sampling pseudo patch pairs based on semantic similarity and aligning representations through attraction of related pairs and repulsion of unrelated ones.

Result: UNIV achieves superior performance on infrared tasks (+1.7 mIoU for semantic segmentation and +0.7 mAP for detection) while maintaining competitive accuracy on RGB tasks, using the comprehensive MVIP benchmark with 98,992 aligned image pairs.

Conclusion: The proposed UNIV model successfully addresses cross-modal degradation by focusing on semantic structure rather than pattern shortcuts, enabling robust joint RGB-infrared perception across diverse conditions.

Abstract: Joint RGB-infrared perception is essential for achieving robustness under diverse weather and illumination conditions. Although foundation models excel within single modalities, they suffer from substantial cross-modal degradation, an issue we attribute to a pattern shortcut, i.e., a modal bias that prioritizes superficial sensor patterns over underlying semantics. To address this problem, we introduce UNIV, a Unified foundation model for Infrared and Visible modalities. At the core of UNIV lies Patch Cross-modal Contrastive Learning (PCCL), a self-supervised contrastive learning strategy that constructs a unified cross-modal feature space. PCCL employs a frozen pre-trained model to sample pseudo patch pairs based on semantic similarity, and aligns infrared-visible representations by attracting semantically related pairs while repelling unrelated ones. This process simultaneously enhances cross-modal alignment and inter-class semantic separability, guiding the model to focus on semantic structure rather than falling into pattern shortcuts. To further enable cross-modal learning, we introduce MVIP, the most comprehensive visible-infrared benchmark to date, containing 98,992 precisely aligned image pairs across diverse scenes. Extensive experiments demonstrate UNIV’s superior performance on infrared tasks (+1.7 mIoU for semantic segmentation and +0.7 mAP for detection), while maintaining competitive accuracy on RGB tasks.

[208] Wonder3D++: Cross-domain Diffusion for High-fidelity 3D Generation from a Single Image

Yuxiao Yang, Xiao-Xiao Long, Zhiyang Dou, Cheng Lin, Yuan Liu, Qingsong Yan, Yuexin Ma, Haoqian Wang, Zhiqiang Wu, Wei Yin

Main category: cs.CV

TL;DR: Wonder3D++ is a novel method for efficient high-fidelity textured mesh generation from single-view images using cross-domain diffusion and multi-view attention.

DetailsMotivation: To address limitations of existing methods: SDS-based approaches are slow with inconsistent geometry, while fast inference methods produce low-quality results lacking geometric details.

Method: Proposes cross-domain diffusion model generating multi-view normal maps and color images, multi-view cross-domain attention for consistency, and cascaded 3D mesh extraction algorithm for coarse-to-fine surface reconstruction.

Result: Achieves high-quality reconstruction results with robust generalization and good efficiency (~3 minutes per shape), outperforming prior works.

Conclusion: Wonder3D++ holistically improves quality, consistency, and efficiency of single-view 3D reconstruction through its novel cross-domain approach and efficient mesh extraction.

Abstract: In this work, we introduce \textbf{Wonder3D++}, a novel method for efficiently generating high-fidelity textured meshes from single-view images. Recent methods based on Score Distillation Sampling (SDS) have shown the potential to recover 3D geometry from 2D diffusion priors, but they typically suffer from time-consuming per-shape optimization and inconsistent geometry. In contrast, certain works directly produce 3D information via fast network inferences, but their results are often of low quality and lack geometric details. To holistically improve the quality, consistency, and efficiency of single-view reconstruction tasks, we propose a cross-domain diffusion model that generates multi-view normal maps and the corresponding color images. To ensure the consistency of generation, we employ a multi-view cross-domain attention mechanism that facilitates information exchange across views and modalities. Lastly, we introduce a cascaded 3D mesh extraction algorithm that drives high-quality surfaces from the multi-view 2D representations in only about $3$ minute in a coarse-to-fine manner. Our extensive evaluations demonstrate that our method achieves high-quality reconstruction results, robust generalization, and good efficiency compared to prior works. Code available at https://github.com/xxlong0/Wonder3D/tree/Wonder3D_Plus.

[209] Clothing agnostic Pre-inpainting Virtual Try-ON

Sehyun Kim, Hye Jun Lee, Jiwoo Lee, Taemin Lee

Main category: cs.CV

TL;DR: CaP-VTON improves virtual try-on by addressing clothing silhouette persistence and skin restoration issues through multi-category masking and skin inpainting preprocessing, achieving 92.5% accuracy in short-sleeved synthesis.

DetailsMotivation: To solve limitations in existing diffusion-based virtual try-on methods, particularly bottom detection inaccuracy and persistent clothing silhouettes in synthesis results.

Method: Integrates DressCode-based multi-category masking and Stable Diffusion-based skin inpainting preprocessing with a generated skin module to handle sleeve length conversions and improve full-body clothing synthesis naturalness.

Result: Achieved 92.5% short-sleeved synthesis accuracy (15.4% better than Leffa) and consistently reproduced reference clothing style and shape in visual evaluations.

Conclusion: The proposed structures maintain model-agnostic properties and can be applied to various diffusion-based virtual try-on systems, contributing to high-precision applications in e-commerce, custom styling, and avatar creation.

Abstract: With the development of deep learning technology, virtual try-on technology has devel-oped important application value in the fields of e-commerce, fashion, and entertainment. The recently proposed Leffa technology has addressed the texture distortion problem of diffusion-based models, but there are limitations in that the bottom detection inaccuracy and the existing clothing silhouette persist in the synthesis results. To solve this problem, this study proposes CaP-VTON (Clothing Agnostic Pre-Inpainting Virtual Try-On). CaP-VTON integrates DressCode-based multi-category masking and Stable Diffu-sion-based skin inflation preprocessing; in particular, a generated skin module was in-troduced to solve skin restoration problems that occur when long-sleeved images are con-verted to short-sleeved or sleeveless ones, introducing a preprocessing structure that im-proves the naturalness and consistency of full-body clothing synthesis, and allowing the implementation of high-quality restoration considering human posture and color. As a result, CaP-VTON achieved 92.5%, which is 15.4% better than Leffa, in short-sleeved syn-thesis accuracy, and consistently reproduced the style and shape of the reference clothing in visual evaluation. These structures maintain model-agnostic properties and are appli-cable to various diffusion-based virtual inspection systems; they can also contribute to applications that require high-precision virtual wearing, such as e-commerce, custom styling, and avatar creation.

[210] IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video?

Yang Chen, Minghao Liu, Yufan Shen, Yunwen Li, Tianyuan Huang, Xinyu Fang, Tianyu Zheng, Wenxuan Huang, Cheng Yang, Daocheng Fu, Jianbiao Mei, Rong Wu, Yunfei Zhao, Licheng Wen, Xuemeng Yang, Song Mao, Qunshu Lin, Zhi Yu, Yongliang Shen, Yu Qiao, Botian Shi

Main category: cs.CV

TL;DR: IWR-Bench is a new benchmark for evaluating Large Vision-Language Models on interactive webpage reconstruction from video, addressing limitations of static screenshot-to-code tasks by incorporating dynamic interactions from 113 real-world website tasks.

DetailsMotivation: Existing benchmarks focus on static screenshot-to-code tasks, overlooking dynamic interactions essential for real-world web applications. This gap limits evaluation of models' ability to handle interactive elements.

Method: Created IWR-Bench with 113 tasks from 100 real websites, featuring 1,001 actions and diverse interaction complexities. Includes user interaction videos and crawled static assets. Uses agent-as-a-judge framework with comprehensive metrics to assess functional correctness and visual fidelity.

Result: Extensive experiments on 28 LVLMs show poor performance: best model achieves only 36.35% overall score, with functional correctness (24.39% IFS) significantly lagging behind visual fidelity (64.25% VFS).

Conclusion: Current models struggle with temporal dynamics reasoning and event-driven logic synthesis. IWR-Bench establishes a challenging frontier for vision-language research, highlighting critical limitations in interactive webpage reconstruction capabilities.

Abstract: The webpage-to-code task requires models to understand visual representations of webpages and generate corresponding code. However, existing benchmarks primarily focus on static screenshot-to-code tasks, thereby overlooking the dynamic interactions fundamental to real-world web applications. To address this limitation, this paper introduces IWR-Bench, a novel benchmark for evaluating the capabilities of Large Vision-Language Models (LVLMs) in interactive webpage reconstruction from video. IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, with 1,001 actions and featuring diverse interaction complexities (e.g., web games), visual styles, and domains. Aligning with standard web development practices, each task includes not only user interaction videos but also all crawled static assets (e.g., images, videos). This benchmark evaluates models on two fundamental challenges: comprehensive multi-modal reasoning to infer interaction logic from video and assets, and advanced code generation to translate this logic into functional code. An agent-as-a-judge framework with a comprehensive metric system automatically assesses the functional correctness and visual fidelity of generated webpages. Extensive experiments on 28 LVLMs reveal a significant challenge: the best model achieves an overall score of only 36.35%, as functional correctness (24.39% IFS) lags significantly behind visual fidelity (64.25% VFS). These results highlight critical limitations in current models’ ability to reason about temporal dynamics and synthesize event-driven logic, establishing IWR-Bench as a challenging frontier for vision-language research. The benchmark and evaluation code will be made publicly available at https://github.com/SIGMME/IWR-Bench.

[211] A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space

Huijie Liu, Shuhao Cui, Haoxiang Cao, Shuai Ma, Kai Wu, Guoliang Kang

Main category: cs.CV

TL;DR: CoTyle enables generating novel, consistent visual styles using just a numerical code, eliminating the need for complex prompts or reference images.

DetailsMotivation: Existing generative approaches struggle with style consistency, limited creativity, and complex style representations, requiring lengthy prompts or reference images.

Method: Train a discrete style codebook from images to extract style embeddings, then train an autoregressive style generator on these embeddings to synthesize novel style codes that guide a text-to-image diffusion model.

Result: CoTyle successfully generates diverse and consistent visual styles from numerical codes, demonstrating that a style can be effectively represented by a single code.

Conclusion: A style is worth one numerical code - CoTyle provides a simple yet powerful approach for code-to-style image generation, unlocking vast reproducible style spaces with minimal input.

Abstract: Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we affirm that a style is worth one numerical code by introducing the novel task, code-to-style image generation, which produces images with novel, consistent visual styles conditioned solely on a numerical style code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Unlike existing methods, our method offers unparalleled simplicity and diversity, unlocking a vast space of reproducible styles from minimal input. Extensive experiments validate that CoTyle effectively turns a numerical code into a style controller, demonstrating a style is worth one code.

[212] Visual Odometry with Transformers

Vlardimir Yugay, Duy-Kien Nguyen, Theo Gevers, Cees G. M. Snoek, Martin R. Oswald

Main category: cs.CV

TL;DR: VoT is a transformer-based visual odometry model that directly regresses relative pose, eliminating traditional components like bundle adjustment and feature matching, achieving 4x speedup over classical methods and 10x speedup over 3D foundation models.

DetailsMotivation: Classical VO approaches rely heavily on camera parameters and handcrafted components with complex bundle adjustment and feature matching, which are slow and don't scale well with large training data.

Method: Proposes Visual Odometry Transformer (VoT) that formulates monocular VO as direct relative pose regression, using an end-to-end transformer architecture without bundle adjustment, feature matching, or camera calibration.

Result: VoT achieves up to 4x faster than traditional approaches with competitive/better performance, 10x faster than recent 3D foundation models, and shows strong scaling behavior with model sizes and training data.

Conclusion: VoT effectively bridges the gap between optimization-based and end-to-end approaches, generalizing well in low-data regimes and unseen scenarios while eliminating traditional VO pipeline complexities.

Abstract: Despite the rapid development of large 3D models, classical optimization-based approaches dominate the field of visual odometry (VO). Thus, current approaches to VO heavily rely on camera parameters and many handcrafted components, most of which involve complex bundle adjustment and feature-matching processes. Although disregarded in the literature, we find it problematic in terms of both (1) speed, that performs bundle adjustment requires a significant amount of time, and (2) scalability, as hand-crafted components struggle to learn from large-scale training data. In this work, we introduce a simple yet efficient architecture, Visual Odometry Transformer (VoT), that formulates monocular visual odometry as a direct relative pose regression problem. Our approach streamlines the monocular visual odometry pipeline in an end-to-end manner, effectively eliminating the need for handcrafted components such as bundle adjustment, feature matching, or camera calibration. We show that VoT is up to 4 times faster than traditional approaches, yet with competitive or better performance. Compared to recent 3D foundation models, VoT runs 10 times faster with strong scaling behavior in terms of both model sizes and training data. Moreover, VoT generalizes well in both low-data regimes and previously unseen scenarios, reducing the gap between optimization-based and end-to-end approaches.

[213] Label-Efficient Cross-Modality Generalization for Liver Segmentation in Multi-Phase MRI

Quang-Khai Bui-Tran, Minh-Toan Dinh, Thanh-Huy Nguyen, Ba-Thinh Lam, Mai-Anh Vu, Ulas Bagci

Main category: cs.CV

TL;DR: A label-efficient liver segmentation method for multi-phase MRI that uses foundation model adaptation and co-training to handle limited labeled data, unlabeled sequences, and cross-modality generalization without spatial registration.

DetailsMotivation: Liver segmentation in multi-phase MRI is crucial for fibrosis assessment, but faces challenges with scarce labeled data, uneven distribution across modalities/vendors, spatial misalignment, and missing phases in real-world clinical settings.

Method: Integrates a foundation-scale 3D segmentation backbone adapted via fine-tuning, co-training with cross pseudo supervision to leverage unlabeled volumes, and a standardized preprocessing pipeline without requiring spatial registration.

Result: The model learns to generalize across MRI phases and vendors, demonstrating robust segmentation performance in both labeled and unlabeled domains.

Conclusion: The approach shows effectiveness for label-efficient liver segmentation in multi-phase, multi-vendor MRI and highlights the potential of combining foundation model adaptation with co-training for real-world clinical imaging tasks.

Abstract: Accurate liver segmentation in multi-phase MRI is vital for liver fibrosis assessment, yet labeled data is often scarce and unevenly distributed across imaging modalities and vendor systems. We propose a label-efficient segmentation approach that promotes cross-modality generalization under real-world conditions, where GED4 hepatobiliary-phase annotations are limited, non-contrast sequences (T1WI, T2WI, DWI) are unlabeled, and spatial misalignment and missing phases are common. Our method integrates a foundation-scale 3D segmentation backbone adapted via fine-tuning, co-training with cross pseudo supervision to leverage unlabeled volumes, and a standardized preprocessing pipeline. Without requiring spatial registration, the model learns to generalize across MRI phases and vendors, demonstrating robust segmentation performance in both labeled and unlabeled domains. Our results exhibit the effectiveness of our proposed label-efficient baseline for liver segmentation in multi-phase, multi-vendor MRI and highlight the potential of combining foundation model adaptation with co-training for real-world clinical imaging tasks.

[214] DINOv3 as a Frozen Encoder for CRPS-Oriented Probabilistic Rainfall Nowcasting

Luciano Araujo Dourado Filho, Almir Moreira da Silva Neto, Anthony Miyaguchi, Rodrigo Pereira David, Rodrigo Tripodi Calumby, Lukáš Picek

Main category: cs.CV

TL;DR: A competitive probabilistic rainfall nowcasting method using a video projector with a lightweight probabilistic head attached to a pre-trained satellite vision encoder, achieving 26% effectiveness gain over 3D-UNET baselines.

DetailsMotivation: To develop a computationally efficient approach for probabilistic rainfall nowcasting that outperforms existing methods.

Method: Use a video projector (V-JEPA Vision Transformer) with a lightweight probabilistic head attached to a pre-trained satellite vision encoder (DINOv3-SAT493M) to map encoder tokens into a discrete empirical CDF over 4-hour accumulated rainfall, optimized end-to-end over the Ranked Probability Score (RPS). Compare with 3D-UNET baselines trained with aggregate RPS and per-pixel Gamma-Hurdle objective.

Result: Achieved CRPS of 3.5102 on Weather4Cast 2025 benchmark, representing approximately 26% effectiveness gain against the best 3D-UNET baseline.

Conclusion: The proposed method demonstrates promising performance and significant improvement over traditional 3D-UNET approaches for probabilistic rainfall nowcasting.

Abstract: This paper proposes a competitive and computationally efficient approach to probabilistic rainfall nowcasting. A video projector (V-JEPA Vision Transformer) associated to a lightweight probabilistic head is attached to a pre-trained satellite vision encoder (DINOv3-SAT493M) to map encoder tokens into a discrete empirical CDF (eCDF) over 4-hour accumulated rainfall. The projector-head is optimized end-to-end over the Ranked Probability Score (RPS). As an alternative, 3D-UNET baselines trained with an aggregate Rank Probability Score and a per-pixel Gamma-Hurdle objective are used. On the Weather4Cast 2025 benchmark, the proposed method achieved a promising performance, with a CRPS of 3.5102, which represents $\approx$ 26% in effectiveness gain against the best 3D-UNET.

[215] Differentiable, Bit-shifting, and Scalable Quantization without training neural network from scratch

Zia Badar

Main category: cs.CV

TL;DR: This paper presents a differentiable quantization method for neural networks that provides convergence guarantees and supports multi-bit quantization, achieving near-full-precision accuracy with weight-only quantization and state-of-the-art results with weight+activation quantization.

DetailsMotivation: Previous quantization methods lacked differentiability (using non-differentiable approaches with manually set derivatives) and convergence guarantees. Also, existing shift/logarithmic quantization approaches either avoided activation quantization or achieved poor accuracy, and couldn't scale beyond 1-bit quantization.

Method: The authors propose a differentiable quantization approach that supports n-bit quantization (including shift bit quantization) and provides proof of convergence to optimal neural networks. The method enables both weight and activation quantization using logarithmic values of form 2^n.

Result: On ImageNet with ResNet18: weight-only quantization achieves <1% accuracy drop compared to full precision; weight+activation quantization achieves SOTA accuracy comparable to other approaches. Both results achieved in only 15 training epochs. Inference requires slightly higher CPU instructions than 1-bit quantization but no high-precision multiplication.

Conclusion: The proposed differentiable quantization method provides convergence guarantees, supports multi-bit quantization, and achieves excellent accuracy-efficiency trade-offs, making it a practical solution for neural network quantization.

Abstract: Quantization of neural networks provides benefits of inference in less compute and memory requirements. Previous work in quantization lack two important aspects which this work provides. First almost all previous work in quantization used a non-differentiable approach and for learning; the derivative is usually set manually in backpropogation which make the learning ability of algorithm questionable, our approach is not just differentiable, we also provide proof of convergence of our approach to the optimal neural network. Second previous work in shift/logrithmic quantization either have avoided activation quantization along with weight quantization or achieved less accuracy. Learning logrithmic quantize values of form $2^n$ requires the quantization function can scale to more than 1 bit quantization which is another benifit of our quantization that it provides $n$ bits quantization as well. Our approach when tested with image classification task using imagenet dataset, resnet18 and weight quantization only achieves less than 1 percent accuracy compared to full precision accuracy while taking only 15 epochs to train using shift bit quantization and achieves comparable to SOTA approaches accuracy in both weight and activation quantization using shift bit quantization in 15 training epochs with slightly higher(only higher cpu instructions) inference cost compared to 1 bit quantization(without logrithmic quantization) and not requiring any higher precision multiplication.

[216] Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent

Christy Li, Josep Lopez Camuñas, Jake Thomas Touchet, Jacob Andreas, Agata Lapedriza, Antonio Torralba, Tamar Rott Shaham

Main category: cs.CV

TL;DR: An automated framework using self-reflective agents to detect visual attribute dependencies in vision models through iterative hypothesis generation and testing.

DetailsMotivation: To detect unintended reliance on specific visual features in vision models, ensuring robustness, preventing overfitting, and avoiding spurious correlations.

Method: Self-reflective agent that systematically generates and tests hypotheses about visual attributes, iteratively refining hypotheses based on experimental outcomes and using self-evaluation protocols.

Result: Agent performance consistently improves with self-reflection, showing significant performance increase over non-reflective baselines on 130 models across 18 categories. Successfully identifies real-world dependencies in CLIP’s vision encoder and YOLOv8.

Conclusion: The self-reflective framework effectively detects visual attribute dependencies in vision models, demonstrating improved performance through iterative refinement and validation.

Abstract: When a vision model performs image recognition, which visual attributes drive its predictions? Detecting unintended reliance on specific visual features is critical for ensuring model robustness, preventing overfitting, and avoiding spurious correlations. We introduce an automated framework for detecting such dependencies in trained vision models. At the core of our method is a self-reflective agent that systematically generates and tests hypotheses about visual attributes that a model may rely on. This process is iterative: the agent refines its hypotheses based on experimental outcomes and uses a self-evaluation protocol to assess whether its findings accurately explain model behavior. When inconsistencies arise, the agent self-reflects over its findings and triggers a new cycle of experimentation. We evaluate our approach on a novel benchmark of 130 models designed to exhibit diverse visual attribute dependencies across 18 categories. Our results show that the agent’s performance consistently improves with self-reflection, with a significant performance increase over non-reflective baselines. We further demonstrate that the agent identifies real-world visual attribute dependencies in state-of-the-art models, including CLIP’s vision encoder and the YOLOv8 object detector.

[217] FairJudge: MLLM Judging for Social Attributes and Prompt Image Alignment

Zahraa Al Sahili, Maryam Fetanat, Maimuna Nowaz, Ioannis Patras, Matthew Purver

Main category: cs.CV

TL;DR: FairJudge is a lightweight protocol using multimodal LLMs as fair judges to evaluate text-to-image model alignment with prompts and social attributes, providing accountable, evidence-based scoring that outperforms contrastive and face-centric baselines.

DetailsMotivation: Current T2I evaluation methods lack reliable ways to assess image-prompt alignment and social attribute treatment, relying on surface cues and lacking calibrated abstention for weakly visible attributes like religion, culture, and disability.

Method: Uses instruction-following multimodal LLMs as fair judges with explanation-oriented rubrics mapped to [-1,1], constrained judgments to closed label sets, evidence grounding in visible content, and mandatory abstention when cues are insufficient.

Result: Outperforms contrastive and face-centric baselines on demographic prediction across gender, race, age, religion, culture, and disability datasets, while maintaining high profession accuracy on IdenProf, FairCoT-Professions, and new DIVERSIFY-Professions.

Conclusion: FairJudge enables more reliable, reproducible fairness audits for T2I systems by providing accountable, evidence-aware decisions without requiring generator modifications, addressing evaluation fairness directly.

Abstract: Text-to-image (T2I) systems lack simple, reproducible ways to evaluate how well images match prompts and how models treat social attributes. Common proxies – face classifiers and contrastive similarity – reward surface cues, lack calibrated abstention, and miss attributes only weakly visible (for example, religion, culture, disability). We present FairJudge, a lightweight protocol that treats instruction-following multimodal LLMs as fair judges. It scores alignment with an explanation-oriented rubric mapped to [-1, 1]; constrains judgments to a closed label set; requires evidence grounded in the visible content; and mandates abstention when cues are insufficient. Unlike CLIP-only pipelines, FairJudge yields accountable, evidence-aware decisions; unlike mitigation that alters generators, it targets evaluation fairness. We evaluate gender, race, and age on FairFace, PaTA, and FairCoT; extend to religion, culture, and disability; and assess profession correctness and alignment on IdenProf, FairCoT-Professions, and our new DIVERSIFY-Professions. We also release DIVERSIFY, a 469-image corpus of diverse, non-iconic scenes. Across datasets, judge models outperform contrastive and face-centric baselines on demographic prediction and improve mean alignment while maintaining high profession accuracy, enabling more reliable, reproducible fairness audits.

[218] CompAgent: An Agentic Framework for Visual Compliance Verification

Rahul Ghosh, Baishali Chaudhury, Hari Prasanna Das, Meghana Ashok, Ryan Razkenari, Sungmin Hong, Chun-Hao Liu

Main category: cs.CV

TL;DR: CompAgent is an agentic framework that augments MLLMs with visual tools for visual compliance verification, outperforming specialized classifiers and achieving 76% F1 score on UnsafeBench.

DetailsMotivation: Visual compliance verification is critical in media/entertainment but underexplored. Existing methods are costly and limited, while MLLMs struggle with fine-grained visual details and structured rule application.

Method: Augments MLLMs with visual tools (object detectors, face analyzers, NSFW detectors, captioning models) and uses a planning agent to dynamically select tools based on compliance policy, plus a verification agent for multimodal reasoning.

Result: Outperforms specialized classifiers, direct MLLM prompting, and routing baselines, achieving up to 76% F1 score and 10% improvement over state-of-the-art on UnsafeBench dataset.

Conclusion: Agentic planning and tool-augmented reasoning enable scalable, accurate, and adaptable visual compliance verification.

Abstract: Visual compliance verification is a critical yet underexplored problem in computer vision, especially in domains such as media, entertainment, and advertising where content must adhere to complex and evolving policy rules. Existing methods often rely on task-specific deep learning models trained on manually labeled datasets, which are costly to build and limited in generalizability. While recent Multimodal Large Language Models (MLLMs) offer broad real-world knowledge and policy understanding, they struggle to reason over fine-grained visual details and apply structured compliance rules effectively on their own. In this paper, we propose CompAgent, the first agentic framework for visual compliance verification. CompAgent augments MLLMs with a suite of visual tools-such as object detectors, face analyzers, NSFW detectors, and captioning models-and introduces a planning agent that dynamically selects appropriate tools based on the compliance policy. A compliance verification agent then integrates image, tool outputs, and policy context to perform multimodal reasoning. Experiments on public benchmarks show that CompAgent outperforms specialized classifiers, direct MLLM prompting, and curated routing baselines, achieving up to 76% F1 score and a 10% improvement over the state-of-the-art on the UnsafeBench dataset. Our results demonstrate the effectiveness of agentic planning and robust tool-augmented reasoning for scalable, accurate, and adaptable visual compliance verification.

[219] Alpha Divergence Losses for Biometric Verification

Dimitrios Koutsianos, Ladislav Mosner, Yannis Panagakis, Themos Stafylakis

Main category: cs.CV

TL;DR: The paper introduces two novel margin-based α-divergence losses (Q-Margin and A3M) for face and speaker verification, addressing training instability in A3M and achieving significant performance gains on challenging benchmarks, particularly at low false acceptance rates.

DetailsMotivation: Existing margin-based softmax losses like CosFace and ArcFace drive performance in verification tasks, but α-divergence losses offer compelling alternatives for sparse solutions. However, integrating angular margin - crucial for verification - is not straightforward with α-divergence.

Method: Proposed two distinct ways to integrate angular margin into α-divergence: Q-Margin (margin in reference measure) and A3M (margin in logits). Addressed A3M training instability with prototype re-initialization strategy.

Result: Achieved significant performance gains on IJB-B and IJB-C face verification benchmarks and strong performance on VoxCeleb speaker verification. Models significantly outperform baselines at low false acceptance rates.

Conclusion: The proposed margin-based α-divergence losses provide effective alternatives to traditional softmax losses, particularly valuable for high-security applications where minimizing false authentications is crucial.

Abstract: Performance in face and speaker verification is largely driven by margin based softmax losses like CosFace and ArcFace. Recently introduced $α$-divergence loss functions offer a compelling alternative, particularly for their ability to induce sparse solutions (when $α>1$). However, integrating an angular margin-crucial for verification tasks-is not straightforward. We find this integration can be achieved in at least two distinct ways: via the reference measure (prior probabilities) or via the logits (unnormalized log-likelihoods). In this paper, we explore both pathways, deriving two novel margin-based $α$-divergence losses: Q-Margin (margin in the reference measure) and A3M (margin in the logits). We identify and address a critical training instability in A3M-caused by the interplay of penalized logits and sparsity-with a simple yet effective prototype re-initialization strategy. Our methods achieve significant performance gains on the challenging IJB-B and IJB-C face verification benchmarks. We demonstrate similarly strong performance in speaker verification on VoxCeleb. Crucially, our models significantly outperform strong baselines at low false acceptance rates (FAR). This capability is crucial for practical high-security applications, such as banking authentication, when minimizing false authentications is paramount.

[220] Spot The Ball: A Benchmark for Visual Social Inference

Neha Balamurugan, Sarah Wu, Adam Chun, Gabe Gaw, Cristobal Eyzaguirre, Tobias Gerstenberg

Main category: cs.CV

TL;DR: Spot The Ball benchmark evaluates visual social inference in VLMs using sports images where the ball is removed, revealing humans outperform models 2-3x by leveraging social cues like gaze and pose.

DetailsMotivation: To assess visual social inference - the human ability to infer hidden scene elements from behavioral cues - which is critical for developing human-like AI agents.

Method: Created a benchmark using soccer, basketball, and volleyball images with removed balls, evaluated four state-of-the-art VLMs with three prompting strategies against human baselines.

Result: Humans were consistently 2-3 times more accurate (20-34%) than models (≤17%) across all sports. Models relied on superficial spatial heuristics while humans used social cues like gaze and body pose.

Conclusion: There’s a persistent gap in visual social reasoning between humans and models, highlighting the need for architectures that explicitly encode structured behavioral cues for robust inference.

Abstract: Humans excel at visual social inference, the ability to infer hidden elements of a scene from subtle behavioral cues such as other people’s gaze, pose, and orientation. This ability drives everyday social reasoning in humans and is critical for developing more human-like AI agents. We introduce Spot The Ball, a challenging benchmark for evaluating visual social inference in vision-language models (VLMs) using sports as a test domain. The task is to localize a removed sports ball from soccer, basketball, and volleyball images. We present a curated evaluation set with human baselines and a scalable pipeline for generating additional test items. We evaluate four state-of-the-art VLMs (Gemini, GPT, LLaMA, Qwen) using three prompting strategies, finding that humans are consistently two to three times more accurate (20-34%) than models ($\leq$ 17%) across all sports. Our analyses show that models rely on superficial spatial heuristics–such as guessing near the image center or nearby players–while humans leverage social cues like gaze direction and body pose. These findings reveal a persistent human-model gap in visual social reasoning and underscore the need for architectures that explicitly encode structured behavioral cues to achieve robust, human-like inference.

[221] H-CNN-ViT: A Hierarchical Gated Attention Multi-Branch Model for Bladder Cancer Recurrence Prediction

Xueyang Li, Zongren Wang, Yuliang Zhang, Zixuan Pan, Yu-Jen Chen, Nishchal Sapkota, Gelei Xu, Danny Z. Chen, Yiyu Shi

Main category: cs.CV

TL;DR: Proposes H-CNN-ViT, a hierarchical multi-branch model for bladder cancer recurrence prediction using multi-sequence MRI, achieving 78.6% AUC and introducing a dedicated dataset.

DetailsMotivation: Bladder cancer has high recurrence rates (up to 78%) requiring accurate monitoring, but MRI interpretation is challenging due to post-surgical changes, and lack of dedicated datasets hinders AI development.

Method: H-CNN-ViT model with hierarchical gated attention that selectively weights features from global (ViT) and local (CNN) paths, processing each MRI modality independently for optimal feature integration.

Result: Achieved 78.6% AUC on their curated dataset, surpassing state-of-the-art models for bladder cancer recurrence prediction.

Conclusion: The proposed H-CNN-ViT model and dedicated dataset provide an effective solution for bladder cancer recurrence prediction, establishing a benchmark for future research in this domain.

Abstract: Bladder cancer is one of the most prevalent malignancies worldwide, with a recurrence rate of up to 78%, necessitating accurate post-operative monitoring for effective patient management. Multi-sequence contrast-enhanced MRI is commonly used for recurrence detection; however, interpreting these scans remains challenging, even for experienced radiologists, due to post-surgical alterations such as scarring, swelling, and tissue remodeling. AI-assisted diagnostic tools have shown promise in improving bladder cancer recurrence prediction, yet progress in this field is hindered by the lack of dedicated multi-sequence MRI datasets for recurrence assessment study. In this work, we first introduce a curated multi-sequence, multi-modal MRI dataset specifically designed for bladder cancer recurrence prediction, establishing a valuable benchmark for future research. We then propose H-CNN-ViT, a new Hierarchical Gated Attention Multi-Branch model that enables selective weighting of features from the global (ViT) and local (CNN) paths based on contextual demands, achieving a balanced and targeted feature fusion. Our multi-branch architecture processes each modality independently, ensuring that the unique properties of each imaging channel are optimally captured and integrated. Evaluated on our dataset, H-CNN-ViT achieves an AUC of 78.6%, surpassing state-of-the-art models. Our model is publicly available at https://github.com/XLIAaron/H-CNN-ViT.

[222] Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination Mitigation

Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Weitao Ma, Xiachong Feng

Main category: cs.CV

TL;DR: FCCT framework systematically analyzes LVLM mechanisms, revealing MHSA’s role in cross-modal aggregation and FFN’s hierarchical processing. IRI technique enhances perception and reduces hallucination through targeted interventions.

DetailsMotivation: Existing LVLM interpretability analyses are insufficient, lacking comprehensive coverage of visual/textual tokens, model components, and layers, limiting insights for improving output faithfulness and hallucination mitigation.

Method: Introduced FCCT framework for fine-grained cross-modal causal tracing, analyzing all visual/textual tokens, MHSA, FFNs, and hidden states across all decoder layers. Proposed IRI for training-free inference-time interventions.

Result: FCCT revealed MHSA in middle layers aggregates cross-modal information, while FFNs show three-stage hierarchical progression for visual object storage/transfer. IRI achieved SOTA performance across 5 benchmarks with preserved inference speed.

Conclusion: FCCT provides comprehensive mechanistic insights into LVLMs, enabling effective interventions like IRI that enhance perception and mitigate hallucination while maintaining model efficiency.

Abstract: Despite the remarkable advancements of Large Vision-Language Models (LVLMs), the mechanistic interpretability remains underexplored. Existing analyses are insufficiently comprehensive and lack examination covering visual and textual tokens, model components, and the full range of layers. This limitation restricts actionable insights to improve the faithfulness of model output and the development of downstream tasks, such as hallucination mitigation. To address this limitation, we introduce Fine-grained Cross-modal Causal Tracing (FCCT) framework, which systematically quantifies the causal effects on visual object perception. FCCT conducts fine-grained analysis covering the full range of visual and textual tokens, three core model components including multi-head self-attention (MHSA), feed-forward networks (FFNs), and hidden states, across all decoder layers. Our analysis is the first to demonstrate that MHSAs of the last token in middle layers play a critical role in aggregating cross-modal information, while FFNs exhibit a three-stage hierarchical progression for the storage and transfer of visual object representations. Building on these insights, we propose Intermediate Representation Injection (IRI), a training-free inference-time technique that reinforces visual object information flow by precisely intervening on cross-modal representations at specific components and layers, thereby enhancing perception and mitigating hallucination. Consistent improvements across five widely used benchmarks and LVLMs demonstrate IRI achieves state-of-the-art performance, while preserving inference speed and other foundational performance.

[223] Cheating Stereo Matching in Full-scale: Physical Adversarial Attack against Binocular Depth Estimation in Autonomous Driving

Kangqiao Zhao, Shuo Huai, Xurui Song, Jun Luo

Main category: cs.CV

TL;DR: First 3D texture-enabled physical adversarial attack against stereo matching models for autonomous driving, using global camouflage textures instead of 2D patches to fool depth estimation.

DetailsMotivation: Existing adversarial attacks mostly target monocular perception with 2D patches, leaving stereo-based binocular depth estimation largely unexplored for physical adversarial examples.

Method: Uses 3D PAE with global camouflage texture, a 3D stereo matching rendering module to handle camera disparity, and a novel merging attack that blends targets into environment through fine-grained optimization.

Result: Extensive evaluations show the PAEs successfully fool stereo models into producing erroneous depth information with enhanced stealth and lethality compared to existing hiding attacks.

Conclusion: The proposed 3D texture-enabled physical adversarial attack effectively compromises stereo matching models in autonomous driving scenarios, demonstrating both visual consistency and attack effectiveness across different viewpoints.

Abstract: Though deep neural models adopted to realize the perception of autonomous driving have proven vulnerable to adversarial examples, known attacks often leverage 2D patches and target mostly monocular perception. Therefore, the effectiveness of Physical Adversarial Examples (PAEs) on stereo-based binocular depth estimation remains largely unexplored. To this end, we propose the first texture-enabled physical adversarial attack against stereo matching models in the context of autonomous driving. Our method employs a 3D PAE with global camouflage texture rather than a local 2D patch-based one, ensuring both visual consistency and attack effectiveness across different viewpoints of stereo cameras. To cope with the disparity effect of these cameras, we also propose a new 3D stereo matching rendering module that allows the PAE to be aligned with real-world positions and headings in binocular vision. We further propose a novel merging attack that seamlessly blends the target into the environment through fine-grained PAE optimization. It has significantly enhanced stealth and lethality upon existing hiding attacks that fail to get seamlessly merged into the background. Extensive evaluations show that our PAEs can successfully fool the stereo models into producing erroneous depth information.

[224] Learning from the Right Patches: A Two-Stage Wavelet-Driven Masked Autoencoder for Histopathology Representation Learning

Raneen Younis, Louay Hamdi, Lukas Chavez, Zahra Ahmadi

Main category: cs.CV

TL;DR: WISE-MAE introduces a wavelet-informed patch selection strategy for MAE pretraining in digital pathology, using coarse-to-fine processing to focus on structurally rich tissue regions and improve representation learning.

DetailsMotivation: Conventional MAE pretraining with random patch sampling often includes irrelevant or noisy regions in whole-slide images, limiting the model's ability to capture meaningful tissue patterns in digital pathology.

Method: A two-step coarse-to-fine process: wavelet-based screening at low magnification to locate structurally rich regions, followed by high-resolution extraction for detailed modeling using Masked Autoencoders with Vision Transformer backbones.

Result: WISE-MAE achieves competitive representation quality and downstream classification performance across multiple cancer datasets (lung, renal, colorectal) while maintaining efficiency under weak supervision.

Conclusion: The wavelet-informed patch selection strategy effectively brings structure and biological relevance into MAE-based learning, mirroring pathologists’ diagnostic workflow and improving learned representations in digital pathology.

Abstract: Whole-slide images are central to digital pathology, yet their extreme size and scarce annotations make self-supervised learning essential. Masked Autoencoders (MAEs) with Vision Transformer backbones have recently shown strong potential for histopathology representation learning. However, conventional random patch sampling during MAE pretraining often includes irrelevant or noisy regions, limiting the model’s ability to capture meaningful tissue patterns. In this paper, we present a lightweight and domain-adapted framework that brings structure and biological relevance into MAE-based learning through a wavelet-informed patch selection strategy. WISE-MAE applies a two-step coarse-to-fine process: wavelet-based screening at low magnification to locate structurally rich regions, followed by high-resolution extraction for detailed modeling. This approach mirrors the diagnostic workflow of pathologists and improves the quality of learned representations. Evaluations across multiple cancer datasets, including lung, renal, and colorectal tissues, show that WISE-MAE achieves competitive representation quality and downstream classification performance while maintaining efficiency under weak supervision.

[225] TrackStudio: An Integrated Toolkit for Markerless Tracking

Hristo Dimitrov, Giulia Dominijanni, Viktorija Pavalkyte, Tamar R. Makin

Main category: cs.CV

TL;DR: TrackStudio is a GUI-based markerless motion tracking toolkit that combines existing open-source tools into an accessible pipeline requiring no programming skills, validated across diverse environments with high accuracy.

DetailsMotivation: There's a gap in accessible, integrated solutions for markerless motion tracking that non-experts can use across diverse settings without requiring substantial technical expertise.

Method: Combined established open-source tools into a single modular GUI-based pipeline providing automatic 2D/3D tracking, calibration, preprocessing, feature extraction, and visualization. Tested across three environments with low-cost webcams and high-resolution cameras.

Result: Across 76 participants, average inter-frame correlations exceeded 0.98 and average triangulation errors remained low (<13.6mm for hand tracking), demonstrating stable and consistent tracking. The pipeline can be extended to other body and face regions.

Conclusion: TrackStudio provides a practical, accessible route into markerless tracking for researchers or laypeople who need reliable performance without specialist expertise.

Abstract: Markerless motion tracking has advanced rapidly in the past 10 years and currently offers powerful opportunities for behavioural, clinical, and biomechanical research. While several specialised toolkits provide high performance for specific tasks, using existing tools still requires substantial technical expertise. There remains a gap in accessible, integrated solutions that deliver sufficient tracking for non-experts across diverse settings. TrackStudio was developed to address this gap by combining established open-source tools into a single, modular, GUI-based pipeline that works out of the box. It provides automatic 2D and 3D tracking, calibration, preprocessing, feature extraction, and visualisation without requiring any programming skills. We supply a user guide with practical advice for video acquisition, synchronisation, and setup, alongside documentation of common pitfalls and how to avoid them. To validate the toolkit, we tested its performance across three environments using either low-cost webcams or high-resolution cameras, including challenging conditions for body position, lightning, and space and obstructions. Across 76 participants, average inter-frame correlations exceeded 0.98 and average triangulation errors remained low (<13.6mm for hand tracking), demonstrating stable and consistent tracking. We further show that the same pipeline can be extended beyond hand tracking to other body and face regions. TrackStudio provides a practical, accessible route into markerless tracking for researchers or laypeople who need reliable performance without specialist expertise.

[226] Fairness-Aware Deepfake Detection: Leveraging Dual-Mechanism Optimization

Feng Ding, Wenhui Yi, Yunpeng Zhou, Xinan He, Hong Rao, Shu Hu

Main category: cs.CV

TL;DR: A dual-mechanism framework for fair deepfake detection that maintains accuracy while improving fairness through structural decoupling and distribution alignment.

DetailsMotivation: Current fairness-enhanced deepfake detectors often sacrifice detection accuracy to improve fairness, creating a trade-off that limits practical deployment in digital identity security applications.

Method: Proposes a dual-mechanism collaborative optimization framework that integrates structural fairness decoupling (separating demographic-sensitive channels at architectural level) and global distribution alignment (reducing distance between overall sample distribution and demographic group distributions at feature level).

Result: Experimental results show the framework improves both inter-group and intra-group fairness while maintaining overall detection accuracy across domains, outperforming other methods.

Conclusion: The proposed framework successfully addresses the fairness-accuracy trade-off in deepfake detection, enabling more equitable deployment in digital identity security without compromising detection performance.

Abstract: Fairness is a core element in the trustworthy deployment of deepfake detection models, especially in the field of digital identity security. Biases in detection models toward different demographic groups, such as gender and race, may lead to systemic misjudgments, exacerbating the digital divide and social inequities. However, current fairness-enhanced detectors often improve fairness at the cost of detection accuracy. To address this challenge, we propose a dual-mechanism collaborative optimization framework. Our proposed method innovatively integrates structural fairness decoupling and global distribution alignment: decoupling channels sensitive to demographic groups at the model architectural level, and subsequently reducing the distance between the overall sample distribution and the distributions corresponding to each demographic group at the feature level. Experimental results demonstrate that, compared with other methods, our framework improves both inter-group and intra-group fairness while maintaining overall detection accuracy across domains.

[227] Point Cloud Quantization through Multimodal Prompting for 3D Understanding

Hongxuan Li, Wencheng Zhu, Huiying Xu, Xinzhong Zhu, Pengfei Zhu

Main category: cs.CV

TL;DR: A multimodal prompting-driven quantization framework for point cloud analysis that uses text embeddings as robust prototype priors and refines them through multimodal prompts to create hybrid geometric-semantic representations.

DetailsMotivation: Current vector quantization methods using trainable vectors or clustered centroids lack representativeness and interpretability, despite multimodal alignment showing promise in vision-language models.

Method: Uses text embeddings from pre-trained models as prototype priors, refines them with multimodal prompts, creates dual-constrained quantization space with compactness and separation regularization, and employs Gumbel-Softmax for differentiable discretization.

Result: Extensive experiments on ModelNet40 and ScanObjectNN datasets demonstrate superior effectiveness compared to existing methods.

Conclusion: The proposed framework successfully addresses limitations of current quantization methods by leveraging multimodal alignment and creating hybrid representations that encode both geometric and semantic information.

Abstract: Vector quantization has emerged as a powerful tool in large-scale multimodal models, unifying heterogeneous representations through discrete token encoding. However, its effectiveness hinges on robust codebook design. Current prototype-based approaches relying on trainable vectors or clustered centroids fall short in representativeness and interpretability, even as multimodal alignment demonstrates its promise in vision-language models. To address these limitations, we propose a simple multimodal prompting-driven quantization framework for point cloud analysis. Our methodology is built upon two core insights: 1) Text embeddings from pre-trained models inherently encode visual semantics through many-to-one contrastive alignment, naturally serving as robust prototype priors; and 2) Multimodal prompts enable adaptive refinement of these prototypes, effectively mitigating vision-language semantic gaps. The framework introduces a dual-constrained quantization space, enforced by compactness and separation regularization, which seamlessly integrates visual and prototype features, resulting in hybrid representations that jointly encode geometric and semantic information. Furthermore, we employ Gumbel-Softmax relaxation to achieve differentiable discretization while maintaining quantization sparsity. Extensive experiments on the ModelNet40 and ScanObjectNN datasets clearly demonstrate the superior effectiveness of the proposed method.

[228] OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs

Feng Chen, Yefei He, Shaoxuan He, Yuanyu He, Jing Liu, Lequan Lin, Akide Liu, Zhaoyang Li, Jiyuan Zhang, Zhenbang Sun, Bohan Zhuang, Qi Wu

Main category: cs.CV

TL;DR: OmniSparse is a training-aware fine-grained sparse attention framework for long-video MLLMs that achieves 2.7x speedup and 2.4x memory reduction while maintaining full attention performance.

DetailsMotivation: Existing sparse attention methods fail to bridge the training-inference gap and lack fine-grained token selection across queries, KV, and heads, leading to suboptimal performance and limited acceleration gains.

Method: Three adaptive mechanisms: (1) query selection via lazy-active classification, (2) KV selection with head-level dynamic budget allocation, and (3) KV cache slimming based on head-level decoding query patterns.

Result: Matches full attention performance while achieving up to 2.7x speedup during prefill and 2.4x memory reduction during decoding.

Conclusion: OmniSparse effectively bridges the training-inference gap and enables fine-grained sparse attention across multiple dimensions for long-video MLLMs.

Abstract: Existing sparse attention methods primarily target inference-time acceleration by selecting critical tokens under predefined sparsity patterns. However, they often fail to bridge the training-inference gap and lack the capacity for fine-grained token selection across multiple dimensions such as queries, key-values (KV), and heads, leading to suboptimal performance and limited acceleration gains. In this paper, we introduce OmniSparse, a training-aware fine-grained sparse attention framework for long-video MLLMs, which operates in both training and inference with dynamic token budget allocation. Specifically, OmniSparse contains three adaptive and complementary mechanisms: (1) query selection via lazy-active classification, retaining active queries that capture broad semantic similarity while discarding most lazy ones that focus on limited local context and exhibit high functional redundancy; (2) KV selection with head-level dynamic budget allocation, where a shared budget is determined based on the flattest head and applied uniformly across all heads to ensure attention recall; and (3) KV cache slimming to reduce head-level redundancy by selectively fetching visual KV cache according to the head-level decoding query pattern. Experimental results show that OmniSparse matches the performance of full attention while achieving up to 2.7x speedup during prefill and 2.4x memory reduction during decoding.

[229] GeoMVD: Geometry-Enhanced Multi-View Generation Model Based on Geometric Information Extraction

Jiaqi Wu, Yaosen Chen, Shuyuan Zhu

Main category: cs.CV

TL;DR: A geometry-guided multi-view diffusion model that uses depth, normal, and segmentation maps to ensure cross-view consistency and generate high-resolution images through decoupled attention and adaptive learning.

DetailsMotivation: Existing multi-view image generation methods face computational challenges in maintaining cross-view consistency and generating high-resolution outputs when extending from single images.

Method: Uses multi-view geometry extraction with depth/normal/segmentation maps, decoupled geometry-enhanced attention, adaptive learning strategy, iterative refinement, and dynamic geometry intensity adjustment.

Result: Generates images that are consistent across views and rich in detail, improving overall image quality and detail preservation.

Conclusion: The proposed model effectively addresses cross-view consistency and high-resolution generation challenges in multi-view image synthesis.

Abstract: Multi-view image generation holds significant application value in computer vision, particularly in domains like 3D reconstruction, virtual reality, and augmented reality. Most existing methods, which rely on extending single images, face notable computational challenges in maintaining cross-view consistency and generating high-resolution outputs. To address these issues, we propose the Geometry-guided Multi-View Diffusion Model, which incorporates mechanisms for extracting multi-view geometric information and adjusting the intensity of geometric features to generate images that are both consistent across views and rich in detail. Specifically, we design a multi-view geometry information extraction module that leverages depth maps, normal maps, and foreground segmentation masks to construct a shared geometric structure, ensuring shape and structural consistency across different views. To enhance consistency and detail restoration during generation, we develop a decoupled geometry-enhanced attention mechanism that strengthens feature focus on key geometric details, thereby improving overall image quality and detail preservation. Furthermore, we apply an adaptive learning strategy that fine-tunes the model to better capture spatial relationships and visual coherence between the generated views, ensuring realistic results. Our model also incorporates an iterative refinement process that progressively improves the output quality through multiple stages of image generation. Finally, a dynamic geometry information intensity adjustment mechanism is proposed to adaptively regulate the influence of geometric data, optimizing overall quality while ensuring the naturalness of generated images. More details can be found on the project page: https://sobeymil.github.io/GeoMVD.com.

[230] HiFusion: Hierarchical Intra-Spot Alignment and Regional Context Fusion for Spatial Gene Expression Prediction from Histopathology

Ziqiao Weng, Yaoyu Fang, Jiahe Qian, Xinkun Wang, Lee AD Cooper, Weidong Cai, Bo Zhou

Main category: cs.CV

TL;DR: HiFusion is a deep learning framework that predicts spatial transcriptomics gene expression from H&E-stained whole-slide images by hierarchically modeling intra-spot morphological features and selectively integrating contextual tissue information.

DetailsMotivation: Existing methods for predicting gene expression from histology images fail to capture biological heterogeneity within spots and are susceptible to morphological noise when using contextual information from surrounding tissue.

Method: HiFusion uses two complementary modules: Hierarchical Intra-Spot Modeling for multi-resolution sub-patch decomposition with feature alignment, and Context-aware Cross-scale Fusion using cross-attention to selectively incorporate relevant regional context.

Result: HiFusion achieves state-of-the-art performance on two benchmark ST datasets in both 2D slide-wise cross-validation and more challenging 3D sample-specific scenarios.

Conclusion: HiFusion provides a robust, accurate, and scalable solution for spatial transcriptomics inference from routine histopathology images, overcoming barriers to clinical adoption.

Abstract: Spatial transcriptomics (ST) bridges gene expression and tissue morphology but faces clinical adoption barriers due to technical complexity and prohibitive costs. While computational methods predict gene expression from H&E-stained whole-slide images (WSIs), existing approaches often fail to capture the intricate biological heterogeneity within spots and are susceptible to morphological noise when integrating contextual information from surrounding tissue. To overcome these limitations, we propose HiFusion, a novel deep learning framework that integrates two complementary components. First, we introduce the Hierarchical Intra-Spot Modeling module that extracts fine-grained morphological representations through multi-resolution sub-patch decomposition, guided by a feature alignment loss to ensure semantic consistency across scales. Concurrently, we present the Context-aware Cross-scale Fusion module, which employs cross-attention to selectively incorporate biologically relevant regional context, thereby enhancing representational capacity. This architecture enables comprehensive modeling of both cellular-level features and tissue microenvironmental cues, which are essential for accurate gene expression prediction. Extensive experiments on two benchmark ST datasets demonstrate that HiFusion achieves state-of-the-art performance across both 2D slide-wise cross-validation and more challenging 3D sample-specific scenarios. These results underscore HiFusion’s potential as a robust, accurate, and scalable solution for ST inference from routine histopathology.

[231] SymGS : Leveraging Local Symmetries for 3D Gaussian Splatting Compression

Keshav Gupta, Akshat Sanghvi, Shreyas Reddy Palley, Astitva Srivastava, Charu Sharma, Avinash Sharma

Main category: cs.CV

TL;DR: SymGS is a symmetry-aware compression framework for 3D Gaussian Splatting that uses learnable mirrors to eliminate redundant primitives, achieving 108× compression while preserving rendering quality.

DetailsMotivation: 3D Gaussian Splatting has high memory footprint that scales with scene complexity, and existing compression methods have limitations in exploiting symmetry-based redundancies.

Method: Introduces learnable mirrors to detect and eliminate local and global reflective redundancies, functioning as a plug-and-play enhancement to existing compression methods like HAC.

Result: Achieves 1.66× compression over HAC across benchmark datasets (up to 3× on large-scale scenes) and 108× average compression of 3DGS scenes while preserving rendering quality.

Conclusion: SymGS effectively reduces memory footprint in 3D Gaussian Splatting through symmetry-aware compression, significantly outperforming existing methods.

Abstract: 3D Gaussian Splatting has emerged as a transformative technique in novel view synthesis, primarily due to its high rendering speed and photorealistic fidelity. However, its memory footprint scales rapidly with scene complexity, often reaching several gigabytes. Existing methods address this issue by introducing compression strategies that exploit primitive-level redundancy through similarity detection and quantization. We aim to surpass the compression limits of such methods by incorporating symmetry-aware techniques, specifically targeting mirror symmetries to eliminate redundant primitives. We propose a novel compression framework, SymGS, introducing learnable mirrors into the scene, thereby eliminating local and global reflective redundancies for compression. Our framework functions as a plug-and-play enhancement to state-of-the-art compression methods, (e.g. HAC) to achieve further compression. Compared to HAC, we achieve $1.66 \times$ compression across benchmark datasets (upto $3\times$ on large-scale scenes). On an average, SymGS enables $\bf{108\times}$ compression of a 3DGS scene, while preserving rendering quality. The project page and supplementary can be found at symgs.github.io

[232] What Color Is It? A Text-Interference Multimodal Hallucination Benchmark

Jinkun Zhao, Lei Huang, Haixin Ge, Wenjun Wu

Main category: cs.CV

TL;DR: The paper introduces a “What Color Is It” dataset to test color perception hallucinations in Multimodal Large Models (MLMs) and proposes solutions to improve their robustness.

DetailsMotivation: MLMs are susceptible to informational interference in visual perception, especially color perception, which increases hallucination risks.

Method: Created a novel benchmark dataset using a simple method to trigger single-modality visual hallucination in MLMs.

Result: The dataset successfully triggers visual hallucinations in MLMs, revealing vulnerabilities in color perception.

Conclusion: The study identifies causes of visual modality hallucination in MLMs and suggests potential solutions to enhance their robustness.

Abstract: With the rapid advancement of Large Models, numerous text-and-vision-fused Multimodal Large Models (MLMs) have emerged. However, these MLMs remain susceptible to informational interference in visual perception, particularly in color perception, which introduces an additional risk of hallucination. To validate this hypothesis, we introduce the “What Color Is It” dataset, a novel benchmark constructed using a simple method to trigger single-modality visual hallucination in MLMs. Based on this dataset, we further investigate the underlying causes of hallucination in the visual modality of MLMs and propose potential solutions to enhance their robustness.

[233] Adaptive Multi-Scale Integration Unlocks Robust Cell Annotation in Histopathology Images

Yinuo Xu, Yan Cui, Mingyao Li, Zhi Huang

Main category: cs.CV

TL;DR: NuClass is a multi-scale framework that combines nuclear morphology and tissue context for cell classification, achieving up to 96% F1 score on held-out datasets.

DetailsMotivation: Existing tile-based models miss broader tissue context that influences cell identity, and current human annotations are coarse-grained and uneven, making fine-grained cell subtype classification difficult.

Method: NuClass integrates nuclear morphology (Path local from 224x224 crops) and microenvironmental context (Path global from 1024x1024 neighborhoods) through a learnable gating module with uncertainty-guided objective that prioritizes regions where local path is uncertain.

Result: Achieves up to 96% F1 for best-performing class on three fully held-out cohorts, outperforming strong baselines.

Conclusion: Multi-scale, uncertainty-aware fusion can bridge the gap between slide-level pathological foundation models and reliable cell-level phenotype prediction.

Abstract: Identifying cell types and subtypes in routine histopathology is fundamental for understanding disease. Existing tile-based models capture nuclear detail but miss the broader tissue context that influences cell identity. Current human annotations are coarse-grained and uneven across studies, making fine-grained, subtype-level classification difficult. In this study, we build a marker-guided dataset from Xenium spatial transcriptomics with single-cell resolution labels for more than two million cells across eight organs and 16 classes to address the lack of high-quality annotations. Leveraging this data resource, we introduce NuClass, a pathologist workflow inspired framework for cell-wise multi-scale integration of nuclear morphology and microenvironmental context. It combines Path local, which focuses on nuclear morphology from 224x224 pixel crops, and Path global, which models the surrounding 1024x1024 pixel neighborhood, through a learnable gating module that balances local and global information. An uncertainty-guided objective directs the global path to prioritize regions where the local path is uncertain, and we provide calibrated confidence estimates and Grad-CAM maps for interpretability. Evaluated on three fully held-out cohorts, NuClass achieves up to 96 percent F1 for its best-performing class, outperforming strong baselines. Our results demonstrate that multi-scale, uncertainty-aware fusion can bridge the gap between slide-level pathological foundation models and reliable, cell-level phenotype prediction.

[234] Distribution Matching Distillation Meets Reinforcement Learning

Dengyang Jiang, Dongyang Liu, Zanyi Wang, Qilong Wu, Liuzhuozheng Li, Hengzhuang Li, Xin Jin, David Liu, Zhen Li, Bo Zhang, Mengmeng Wang, Steven Hoi, Peng Gao, Harry Yang

Main category: cs.CV

TL;DR: DMDR combines Reinforcement Learning with Distribution Matching Distillation to improve few-step diffusion model performance beyond the original teacher model.

DetailsMotivation: To overcome the performance limitation where few-step distilled models are capped by their multi-step teachers, enabling the student model to potentially exceed teacher performance.

Method: Integrates RL techniques into distillation process, uses DMD loss as RL regularization, implements dynamic distribution guidance and dynamic renoise sampling strategies.

Result: Achieves leading visual quality and prompt coherence among few-step methods, with performance exceeding the multi-step teacher model.

Conclusion: DMDR successfully unlocks the capacity of few-step generators by simultaneously conducting distillation and RL, demonstrating superior performance over existing few-step methods.

Abstract: Distribution Matching Distillation (DMD) distills a pre-trained multi-step diffusion model to a few-step one to improve inference efficiency. However, the performance of the latter is often capped by the former. To circumvent this dilemma, we propose DMDR, a novel framework that combines Reinforcement Learning (RL) techniques into the distillation process. We show that for the RL of the few-step generator, the DMD loss itself is a more effective regularization compared to the traditional ones. In turn, RL can help to guide the mode coverage process in DMD more effectively. These allow us to unlock the capacity of the few-step generator by conducting distillation and RL simultaneously. Meanwhile, we design the dynamic distribution guidance and dynamic renoise sampling training strategies to improve the initial distillation process. The experiments demonstrate that DMDR can achieve leading visual quality, prompt coherence among few-step methods, and even exhibit performance that exceeds the multi-step teacher.

[235] Uni-Hema: Unified Model for Digital Hematopathology

Abdul Rehman, Iqra Rasool, Ayisha Imran, Mohsen Ali, Waqas Sultani

Main category: cs.CV

TL;DR: Uni-Hema is a unified multi-task model for digital hematopathology that integrates detection, classification, segmentation, morphology prediction, and reasoning across multiple diseases using a multimodal approach.

DetailsMotivation: Existing approaches in digital hematopathology (single-task, vision-language, WSI-optimized, or single-cell models) cannot provide unified, multi-task, multi-modal reasoning across the complexities of diverse disease categories including leukemia, malaria, and sickle cell disease.

Method: Built on Hema-Former, a multimodal module that bridges visual and textual representations at hierarchy levels for different tasks (detection, classification, segmentation, morphology, mask language modeling, visual question answering) at different granularities. Uses 46 public datasets with over 700K images and 21K question-answer pairs.

Result: Uni-Hema achieves comparable or superior performance to single-task, single-dataset models across diverse hematological tasks, while providing interpretable, morphologically relevant insights at the single-cell level.

Conclusion: The framework establishes a new standard for multi-task and multi-modal digital hematopathology, overcoming limitations of existing approaches by providing unified reasoning across complex disease categories.

Abstract: Digital hematopathology requires cell-level analysis across diverse disease categories, including malignant disorders (e.g., leukemia), infectious conditions (e.g., malaria), and non-malignant red blood cell disorders (e.g., sickle cell disease). Whether single-task, vision-language, WSI-optimized, or single-cell hematology models, these approaches share a key limitation, they cannot provide unified, multi-task, multi-modal reasoning across the complexities of digital hematopathology. To overcome these limitations, we propose Uni-Hema, a multi-task, unified model for digital hematopathology integrating detection, classification, segmentation, morphology prediction, and reasoning across multiple diseases. Uni-Hema leverages 46 publicly available datasets, encompassing over 700K images and 21K question-answer pairs, and is built upon Hema-Former, a multimodal module that bridges visual and textual representations at the hierarchy level for the different tasks (detection, classification, segmentation, morphology, mask language modeling and visual question answer) at different granularity. Extensive experiments demonstrate that Uni-Hema achieves comparable or superior performance to train on a single-task and single dataset models, across diverse hematological tasks, while providing interpretable, morphologically relevant insights at the single-cell level. Our framework establishes a new standard for multi-task and multi-modal digital hematopathology. The code will be made publicly available.

[236] Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion

Laura Dodds, Maisy Lam, Waleed Akbar, Yibo Cheng, Fadel Adib

Main category: cs.CV

TL;DR: Wave-Former enables high-accuracy 3D shape reconstruction of completely occluded objects using millimeter-wave signals that can penetrate occlusions, achieving 72% recall and 85% precision.

DetailsMotivation: To enable 3D shape reconstruction of hidden objects for applications in robotics, augmented reality, and logistics, overcoming limitations of past mmWave methods that suffered from limited coverage and high noise.

Method: A three-stage pipeline that bridges raw wireless signals with vision-based shape completion: proposes candidate geometric surfaces, uses a transformer-based shape completion model designed for mmWave signals, and performs entropy-guided surface selection. Can be trained using synthetic point-clouds.

Result: Wave-Former significantly outperforms state-of-the-art baselines, raising recall from 54% to 72% while maintaining high precision of 85%. Shows impressive generalization to real-world data despite being trained on synthetic data.

Conclusion: Wave-Former demonstrates successful 3D shape reconstruction of completely occluded objects using mmWave signals, with a physics-aware approach that bridges wireless signals and computer vision techniques for improved performance.

Abstract: We present Wave-Former, a novel method capable of high-accuracy 3D shape reconstruction for completely occluded, diverse, everyday objects. This capability can open new applications spanning robotics, augmented reality, and logistics. Our approach leverages millimeter-wave (mmWave) wireless signals, which can penetrate common occlusions and reflect off hidden objects. In contrast to past mmWave reconstruction methods, which suffer from limited coverage and high noise, Wave-Former introduces a physics-aware shape completion model capable of inferring full 3D geometry. At the heart of Wave-Former’s design is a novel three-stage pipeline which bridges raw wireless signals with recent advancements in vision-based shape completion by incorporating physical properties of mmWave signals. The pipeline proposes candidate geometric surfaces, employs a transformer-based shape completion model designed specifically for mmWave signals, and finally performs entropy-guided surface selection. This enables Wave-Former to be trained using entirely synthetic point-clouds, while demonstrating impressive generalization to real-world data. In head-to-head comparisons with state-of-the-art baselines, Wave-Former raises recall from 54% to 72% while maintaining a high precision of 85%.

[237] DoGCLR: Dominance-Game Contrastive Learning Network for Skeleton-Based Action Recognition

Yanshan Li, Ke Ma, Miaomiao Wei, Linhui Dai

Main category: cs.CV

TL;DR: DoGCLR is a self-supervised contrastive learning framework for skeleton-based action recognition that uses game theory to optimize positive/negative sample construction and employs spatio-temporal localization with entropy-driven memory management.

DetailsMotivation: Existing self-supervised methods process skeleton regions uniformly and use FIFO queues for negative samples, causing motion information loss and suboptimal negative sample selection.

Method: Models sample construction as a Dominance Game for equilibrium between semantic preservation and discriminative strength. Uses spatio-temporal dual weight localization for key motion regions and entropy-driven dominance strategy for memory bank management.

Result: Achieves 81.1%/89.4% accuracy on NTU RGB+D 60 X-Sub/X-View and 71.2%/75.5% on NTU RGB+D 120 X-Sub/X-Set, surpassing SOTA by 0.1-2.7%. On PKU-MMD Part II, achieves 1.9% higher accuracy than SOTA.

Conclusion: DoGCLR demonstrates strong performance and robustness, particularly in challenging scenarios, through its game-theoretic approach to contrastive learning in skeleton-based action recognition.

Abstract: Existing self-supervised contrastive learning methods for skeleton-based action recognition often process all skeleton regions uniformly, and adopt a first-in-first-out (FIFO) queue to store negative samples, which leads to motion information loss and non-optimal negative sample selection. To address these challenges, this paper proposes Dominance-Game Contrastive Learning network for skeleton-based action Recognition (DoGCLR), a self-supervised framework based on game theory. DoGCLR models the construction of positive and negative samples as a dynamic Dominance Game, where both sample types interact to reach an equilibrium that balances semantic preservation and discriminative strength. Specifically, a spatio-temporal dual weight localization mechanism identifies key motion regions and guides region-wise augmentations to enhance motion diversity while maintaining semantics. In parallel, an entropy-driven dominance strategy manages the memory bank by retaining high entropy (hard) negatives and replacing low-entropy (weak) ones, ensuring consistent exposure to informative contrastive signals. Extensive experiments are conducted on NTU RGB+D and PKU-MMD datasets. On NTU RGB+D 60 X-Sub/X-View, DoGCLR achieves 81.1%/89.4% accuracy, and on NTU RGB+D 120 X-Sub/X-Set, DoGCLR achieves 71.2%/75.5% accuracy, surpassing state-of-the-art methods by 0.1%, 2.7%, 1.1%, and 2.3%, respectively. On PKU-MMD Part I/Part II, DoGCLR performs comparably to the state-of-the-art methods and achieves a 1.9% higher accuracy on Part II, highlighting its strong robustness on more challenging scenarios.

[238] GloTok: Global Perspective Tokenizer for Image Reconstruction and Generation

Xuan Zhao, Zhongyu Zhang, Yuge Huang, Yuxi Mi, Guodong Mu, Shouhong Ding, Jun Wang, Rizen Guo, Shuigeng Zhou

Main category: cs.CV

TL;DR: GloTok introduces a global perspective tokenizer that uses codebook-wise histogram relation learning to create more uniform semantic distributions for better image generation quality.

DetailsMotivation: Existing image tokenization methods use local semantic supervision, which limits semantic distribution uniformity. Since VA-VAE shows that more uniform feature distributions yield better generation performance, the authors aim to create a more globally uniform semantic distribution.

Method: Proposes Global Perspective Tokenizer (GloTok) with codebook-wise histogram relation learning to transfer dataset-wide semantics to semantic codebook, and a residual learning module to recover fine-grained details from quantization errors.

Result: Achieves state-of-the-art reconstruction performance and generation quality on ImageNet-1k benchmark, producing more uniformly distributed semantic latent representations.

Conclusion: GloTok’s global relational modeling approach enables better image generation by creating more uniform semantic distributions without requiring access to pre-trained models during training.

Abstract: Existing state-of-the-art image tokenization methods leverage diverse semantic features from pre-trained vision models for additional supervision, to expand the distribution of latent representations and thereby improve the quality of image reconstruction and generation. These methods employ a locally supervised approach for semantic supervision, which limits the uniformity of semantic distribution. However, VA-VAE proves that a more uniform feature distribution yields better generation performance. In this work, we introduce a Global Perspective Tokenizer (GloTok), which utilizes global relational information to model a more uniform semantic distribution of tokenized features. Specifically, a codebook-wise histogram relation learning method is proposed to transfer the semantics, which are modeled by pre-trained models on the entire dataset, to the semantic codebook. Then, we design a residual learning module that recovers the fine-grained details to minimize the reconstruction error caused by quantization. Through the above design, GloTok delivers more uniformly distributed semantic latent representations, which facilitates the training of autoregressive (AR) models for generating high-quality images without requiring direct access to pre-trained models during the training process. Experiments on the standard ImageNet-1k benchmark clearly show that our proposed method achieves state-of-the-art reconstruction performance and generation quality.

[239] PAVE: An End-to-End Dataset for Production Autonomous Vehicle Evaluation

Xiangyu Li, Chen Wang, Yumao Liu, Dengbo He, Jiahao Zhang, Ke Ma

Main category: cs.CV

TL;DR: PAVE is the first end-to-end autonomous driving dataset collected entirely by real autonomous vehicles, containing over 100 hours of naturalistic driving data with high-precision localization and comprehensive annotations for safety evaluation.

DetailsMotivation: Existing datasets collected by human drivers or unidentified modes are insufficient for evaluating real behavioral safety of autonomous vehicles in black-box scenarios.

Method: Collected over 100 hours of naturalistic data from production AVs, segmented into 32,727 key frames with synchronized camera images, GNSS/IMU data, vehicle trajectories, and detailed 2D annotations of surrounding objects and scenario attributes.

Result: Dataset provides high-precision data (0.8 cm localization accuracy) with rich scenario attributes and supports end-to-end motion planning models achieving 1.4m ADE on autonomous-driving frames.

Conclusion: PAVE dataset enables comprehensive safety evaluation of AVs and continues to expand weekly, providing a sustainable foundation for autonomous driving behavior analysis and safety research.

Abstract: Most existing autonomous-driving datasets (e.g., KITTI, nuScenes, and the Waymo Perception Dataset), collected by human-driving mode or unidentified driving mode, can only serve as early training for the perception and prediction of autonomous vehicles (AVs). To evaluate the real behavioral safety of AVs controlled in the black box, we present the first end-to-end benchmark dataset collected entirely by autonomous-driving mode in the real world. This dataset contains over 100 hours of naturalistic data from multiple production autonomous-driving vehicle models in the market. We segment the original data into 32,727 key frames, each consisting of four synchronized camera images and high-precision GNSS/IMU data (0.8 cm localization accuracy). For each key frame, 20 Hz vehicle trajectories spanning the past 6 s and future 5 s are provided, along with detailed 2D annotations of surrounding vehicles, pedestrians, traffic lights, and traffic signs. These key frames have rich scenario-level attributes, including driver intent, area type (covering highways, urban roads, and residential areas), lighting (day, night, or dusk), weather (clear or rain), road surface (paved or unpaved), traffic and vulnerable road users (VRU) density, traffic lights, and traffic signs (warning, prohibition, and indication). To evaluate the safety of AVs, we employ an end-to-end motion planning model that predicts vehicle trajectories with an Average Displacement Error (ADE) of 1.4 m on autonomous-driving frames. The dataset continues to expand by over 10 hours of new data weekly, thereby providing a sustainable foundation for research on AV driving behavior analysis and safety evaluation. The PAVE dataset is publicly available at https://hkustgz-my.sharepoint.com/:f:/g/personal/kema_hkust-gz_edu_cn/IgDXyoHKfdGnSZ3JbbidjduMAXxs-Z3NXzm005A_Ix9tr0Q?e=9HReCu.

[240] StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

Yifan Yang, Zhi Cen, Sida Peng, Xiangwei Chen, Yifu Deng, Xinyu Zhu, Fan Jia, Xiaowei Zhou, Hujun Bao

Main category: cs.CV

TL;DR: Proposes an autoregressive diffusion model for speech-driven 3D facial animation that processes audio in streaming manner to handle varying lengths and achieve low latency.

DetailsMotivation: Address limitations of existing methods that process entire audio sequences at once, which perform poorly with long sequences and suffer from significant latency.

Method: Uses autoregressive diffusion model with limited past frames as historical motion context combined with audio input to create dynamic conditions for iterative facial motion generation.

Result: Achieves real-time synthesis with high-quality results, flexible handling of varying audio lengths, and low latency independent of audio duration.

Conclusion: The proposed streaming approach effectively overcomes limitations of single-pass processing methods and enables efficient real-time 3D facial animation.

Abstract: This paper focuses on the task of speech-driven 3D facial animation, which aims to generate realistic and synchronized facial motions driven by speech inputs. Recent methods have employed audio-conditioned diffusion models for 3D facial animation, achieving impressive results in generating expressive and natural animations. However, these methods process the whole audio sequences in a single pass, which poses two major challenges: they tend to perform poorly when handling audio sequences that exceed the training horizon and will suffer from significant latency when processing long audio inputs. To address these limitations, we propose a novel autoregressive diffusion model that processes input audio in a streaming manner. This design ensures flexibility with varying audio lengths and achieves low latency independent of audio duration. Specifically, we select a limited number of past frames as historical motion context and combine them with the audio input to create a dynamic condition. This condition guides the diffusion process to iteratively generate facial motion frames, enabling real-time synthesis with high-quality results. Additionally, we implemented a real-time interactive demo, highlighting the effectiveness and efficiency of our approach. We will release the code at https://zju3dv.github.io/StreamingTalker/.

[241] Gaussian Splatting-based Low-Rank Tensor Representation for Multi-Dimensional Image Recovery

Yiming Zeng, Xi-Le Zhao, Wei-Hao Wu, Teng-Yu Ji, Chao Wang

Main category: cs.CV

TL;DR: GSLR uses Gaussian splatting to improve tensor SVD representation for multi-dimensional images, addressing limitations in capturing local high-frequency information through tailored 2D and 1D Gaussian splatting for latent tensor and transform matrix generation.

DetailsMotivation: Current t-SVD methods have two key limitations: coarse approximation of latent tensor that fails to capture spatial local high-frequency information, and fixed basis atoms in transform matrix that cannot precisely capture local high-frequency information along mode-3 fibers.

Method: Proposed Gaussian Splatting-based Low-rank tensor Representation (GSLR) framework using tailored 2D Gaussian splatting for latent tensor generation and 1D Gaussian splatting for transform matrix generation, with complementary representation capabilities.

Result: Extensive experiments on multi-dimensional image recovery show GSLR consistently outperforms state-of-the-art methods, particularly in capturing local high-frequency information.

Conclusion: GSLR provides a compact and continuous representation framework for multi-dimensional images with superior representation capability, especially for local high-frequency information, through the complementary use of 2D and 1D Gaussian splatting.

Abstract: Tensor singular value decomposition (t-SVD) is a promising tool for multi-dimensional image representation, which decomposes a multi-dimensional image into a latent tensor and an accompanying transform matrix. However, two critical limitations of t-SVD methods persist: (1) the approximation of the latent tensor (e.g., tensor factorizations) is coarse and fails to accurately capture spatial local high-frequency information; (2) The transform matrix is composed of fixed basis atoms (e.g., complex exponential atoms in DFT and cosine atoms in DCT) and cannot precisely capture local high-frequency information along the mode-3 fibers. To address these two limitations, we propose a Gaussian Splatting-based Low-rank tensor Representation (GSLR) framework, which compactly and continuously represents multi-dimensional images. Specifically, we leverage tailored 2D Gaussian splatting and 1D Gaussian splatting to generate the latent tensor and transform matrix, respectively. The 2D and 1D Gaussian splatting are indispensable and complementary under this representation framework, which enjoys a powerful representation capability, especially for local high-frequency information. To evaluate the representation ability of the proposed GSLR, we develop an unsupervised GSLR-based multi-dimensional image recovery model. Extensive experiments on multi-dimensional image recovery demonstrate that GSLR consistently outperforms state-of-the-art methods, particularly in capturing local high-frequency information.

cs.AI

[242] The Illusion of Procedural Reasoning: Measuring Long-Horizon FSM Execution in LLMs

Mahdi Samiei, Mahdi Mansouri, Mahdieh Soleymani Baghshah

Main category: cs.AI

TL;DR: FSM Execution is introduced as a minimal, interpretable benchmark to evaluate LLMs’ procedural reasoning capacity, revealing systematic degradation with increased task complexity and highlighting the gap between local accuracy and long-horizon reliability.

DetailsMotivation: To create a controlled, interpretable benchmark that isolates and measures LLMs' procedural reasoning capacity, addressing the lack of frameworks that can systematically evaluate their ability to execute multi-step, rule-based computations without degradation.

Method: Uses Finite-State Machine (FSM) Execution where models are given explicit FSM definitions and must execute them step-by-step with input actions, maintaining state consistency over multiple turns. Measures both Turn Accuracy and Task Accuracy to distinguish immediate computation from cumulative state maintenance.

Result: Empirical results show systematic degradation as task horizon or branching complexity increases. Models perform worse with high branching factors than long memory spans. Larger models improve local accuracy but remain brittle in multi-step reasoning unless prompted to externalize intermediate steps.

Conclusion: FSM-based evaluation provides a transparent, complexity-controlled probe for diagnosing procedural reasoning failures and guiding the development of inductive biases that enable genuine long-horizon procedural competence in LLMs.

Abstract: Large language models (LLMs) have achieved remarkable results on tasks framed as reasoning problems, yet their true ability to perform procedural reasoning, executing multi-step, rule-based computations remains unclear. Unlike algorithmic systems, which can deterministically execute long-horizon symbolic procedures, LLMs often degrade under extended reasoning chains, but there is no controlled, interpretable benchmark to isolate and measure this collapse. We introduce Finite-State Machine (FSM) Execution as a minimal, fully interpretable framework for evaluating the procedural reasoning capacity of LLMs. In our setup, the model is given an explicit FSM definition and must execute it step-by-step given input actions, maintaining state consistency over multiple turns. This task requires no world knowledge, only faithful application of deterministic transition rules, making it a direct probe of the model’s internal procedural fidelity. We measure both Turn Accuracy and Task Accuracy to disentangle immediate computation from cumulative state maintenance. Empirical results reveal systematic degradation as task horizon or branching complexity increases. Models perform significantly worse when rule retrieval involves high branching factors than when memory span is long. Larger models show improved local accuracy but remain brittle under multi-step reasoning unless explicitly prompted to externalize intermediate steps. FSM-based evaluation offers a transparent, complexity-controlled probe for diagnosing this failure mode and guiding the design of inductive biases that enable genuine long-horizon procedural competence. By grounding reasoning in measurable execution fidelity rather than surface correctness, this work helps establish a rigorous experimental foundation for understanding and improving the algorithmic reliability of LLMs.

[243] Learning Interestingness in Automated Mathematical Theory Formation

George Tsoukalas, Rahul Saha, Amitayush Thakur, Sabrina Reguyal, Swarat Chaudhuri

Main category: cs.AI

TL;DR: FERMAT is a reinforcement learning environment for automated mathematical theory discovery, focusing on synthesizing interestingness measures for mathematical objects using evolutionary algorithms enhanced with LLMs.

DetailsMotivation: To address the grand challenge of automating open-ended discovery of new mathematical theories using AI, particularly focusing on concept discovery and theorem-proving.

Method: Introduced FERMAT RL environment with symbolic actions, and used LLM-based evolutionary algorithms with function abstraction to synthesize interestingness measures for mathematical objects.

Result: The LLM-based evolutionary approach showed notable improvements over hard-coded baselines in discovering elementary number theory and finite fields.

Conclusion: FERMAT provides a foundation for RL-based mathematical theory discovery, and evolutionary algorithms with LLM enhancement can effectively discover interesting mathematical concepts.

Abstract: We take two key steps in automating the open-ended discovery of new mathematical theories, a grand challenge in artificial intelligence. First, we introduce $\emph{FERMAT}$, a reinforcement learning (RL) environment that models concept discovery and theorem-proving using a set of symbolic actions, opening up a range of RL problems relevant to theory discovery. Second, we explore a specific problem through $\emph{FERMAT}$: automatically scoring the $\emph{interestingness}$ of mathematical objects. We investigate evolutionary algorithms for synthesizing nontrivial interestingness measures. In particular, we introduce an LLM-based evolutionary algorithm that features function abstraction, leading to notable improvements in discovering elementary number theory and finite fields over hard-coded baselines. We open-source the $\emph{FERMAT}$ environment at this URL(https://github.com/trishullab/Fermat).

[244] Ask WhAI:Probing Belief Formation in Role-Primed LLM Agents

Keith Moore, Jun W. Kim, David Lyu, Jeffrey Heo, Ehsan Adeli

Main category: cs.AI

TL;DR: Ask WhAI is a framework for inspecting and perturbing belief states in multi-agent interactions, applied to medical diagnosis scenarios to study belief formation and epistemic silos.

DetailsMotivation: To understand how beliefs form and persist in multi-agent scientific reasoning, particularly in medical diagnosis where disciplinary biases and resistance to counterevidence can affect outcomes.

Method: The framework records/replays agent interactions, supports belief queries, and enables counterfactual evidence injection. Applied to medical case simulation with LLM agents representing different specialists interacting via shared EMR.

Result: Agent beliefs mirror real-world disciplinary stances, showing overreliance on canonical studies and resistance to counterevidence. Breakpoint analysis distinguishes entrenched priors from reasoning effects.

Conclusion: Ask WhAI makes belief dynamics visible and testable, offering a reproducible way to study belief formation and epistemic silos in multi-agent scientific reasoning that’s not possible with human experts.

Abstract: We present Ask WhAI, a systems-level framework for inspecting and perturbing belief states in multi-agent interactions. The framework records and replays agent interactions, supports out-of-band queries into each agent’s beliefs and rationale, and enables counterfactual evidence injection to test how belief structures respond to new information. We apply the framework to a medical case simulator notable for its multi-agent shared memory (a time-stamped electronic medical record, or EMR) and an oracle agent (the LabAgent) that holds ground truth lab results revealed only when explicitly queried. We stress-test the system on a multi-specialty diagnostic journey for a child with an abrupt-onset neuropsychiatric presentation. Large language model agents, each primed with strong role-specific priors (“act like a neurologist”, “act like an infectious disease specialist”), write to a shared medical record and interact with a moderator across sequential or parallel encounters. Breakpoints at key diagnostic moments enable pre- and post-event belief queries, allowing us to distinguish entrenched priors from reasoning or evidence-integration effects. The simulation reveals that agent beliefs often mirror real-world disciplinary stances, including overreliance on canonical studies and resistance to counterevidence, and that these beliefs can be traced and interrogated in ways not possible with human experts. By making such dynamics visible and testable, Ask WhAI offers a reproducible way to study belief formation and epistemic silos in multi-agent scientific reasoning.

[245] Subnational Geocoding of Global Disasters Using Large Language Models

Michele Ronco, Damien Delforge, Wiebke S. Jäger, Christina Corbane

Main category: cs.AI

TL;DR: An automated LLM-assisted workflow using GPT-4o to geocode unstructured disaster location data from EM-DAT by cross-referencing multiple geoinformation sources (GADM, OpenStreetMap, Wikidata) with reliability scoring.

DetailsMotivation: Subnational disaster location data in databases like EM-DAT are often unstructured, inconsistently formatted, and difficult to integrate with spatial datasets, limiting risk assessment capabilities.

Method: Fully automated workflow using GPT-4o to process and clean textual location information, then assign geometries by cross-checking three geoinformation repositories (GADM, OpenStreetMap, Wikidata) with reliability scoring based on source agreement.

Result: Successfully geocoded 14,215 disaster events across 17,948 unique locations from EM-DAT (2000-2024), creating a reliable spatial dataset without manual intervention.

Conclusion: LLMs offer scalable and reliable methods for extracting structured geographic information from unstructured text, enabling flexible disaster risk analysis with cross-verified location data.

Abstract: Subnational location data of disaster events are critical for risk assessment and disaster risk reduction. Disaster databases such as EM-DAT often report locations in unstructured textual form, with inconsistent granularity or spelling, that make it difficult to integrate with spatial datasets. We present a fully automated LLM-assisted workflow that processes and cleans textual location information using GPT-4o, and assigns geometries by cross-checking three independent geoinformation repositories: GADM, OpenStreetMap and Wikidata. Based on the agreement and availability of these sources, we assign a reliability score to each location while generating subnational geometries. Applied to the EM-DAT dataset from 2000 to 2024, the workflow geocodes 14,215 events across 17,948 unique locations. Unlike previous methods, our approach requires no manual intervention, covers all disaster types, enables cross-verification across multiple sources, and allows flexible remapping to preferred frameworks. Beyond the dataset, we demonstrate the potential of LLMs to extract and structure geographic information from unstructured text, offering a scalable and reliable method for related analyses.

[246] Project Rachel: Can an AI Become a Scholarly Author?

Martin Monperrus, Benoit Baudry, Clément Vidal

Main category: cs.AI

TL;DR: Project Rachel created an AI academic identity named Rachel So that published 10+ papers, received citations, and a peer review invitation, demonstrating how the scholarly ecosystem responds to AI authorship.

DetailsMotivation: To investigate how the scholarly ecosystem responds to AI authorship and contribute empirical data to the debate about the future of scholarly communication with advanced AI systems.

Method: Action research study creating and tracking a complete AI academic identity (Rachel So) that published AI-generated research papers over 8 months (March-October 2025).

Result: Rachel So successfully published 10+ papers, was cited by other researchers, and received a peer review invitation, showing the scholarly system can accept AI-authored work.

Conclusion: The study provides empirical evidence that current scholarly systems can accommodate AI authorship, raising important implications for publishers, researchers, and the scientific system regarding AI’s role in academic communication.

Abstract: This paper documents Project Rachel, an action research study that created and tracked a complete AI academic identity named Rachel So. Through careful publication of AI-generated research papers, we investigate how the scholarly ecosystem responds to AI authorship. Rachel So published 10+ papers between March and October 2025, was cited, and received a peer review invitation. We discuss the implications of AI authorship on publishers, researchers, and the scientific system at large. This work contributes empirical action research data to the necessary debate about the future of scholarly communication with super human, hyper capable AI systems.

[247] Uncertainty-Aware Measurement of Scenario Suite Representativeness for Autonomous Systems

Robab Aghazadeh Chakherlou, Siddartha Khastgir, Xingyu Zhao, Jerein Jeyachandran, Shufeng Chen

Main category: cs.AI

TL;DR: A probabilistic method using imprecise Bayesian inference to quantify dataset representativeness for AI safety, producing uncertainty-aware interval estimates rather than single values.

DetailsMotivation: Ensuring trustworthiness and safety of AI systems like autonomous vehicles requires assessing data representativeness - how well training/testing data reflects real-world operational conditions (ODD/TOD).

Method: Imprecise Bayesian method that compares statistical distributions of scenario suite features with inferred Target Operational Domain (TOD) distributions, handling limited data and uncertain priors to produce interval-valued representativeness estimates.

Result: The method generates uncertainty-aware interval estimates of representativeness both locally (between categories like weather, road type, time of day) and globally, accounting for dependencies and prior uncertainty.

Conclusion: The proposed probabilistic approach provides a rigorous, uncertainty-aware framework for quantifying dataset representativeness, addressing the challenge of limited TOD data through imprecise Bayesian inference.

Abstract: Assuring the trustworthiness and safety of AI systems, e.g., autonomous vehicles (AV), depends critically on the data-related safety properties, e.g., representativeness, completeness, etc., of the datasets used for their training and testing. Among these properties, this paper focuses on representativeness-the extent to which the scenario-based data used for training and testing, reflect the operational conditions that the system is designed to operate safely in, i.e., Operational Design Domain (ODD) or expected to encounter, i.e., Target Operational Domain (TOD). We propose a probabilistic method that quantifies representativeness by comparing the statistical distribution of features encoded by the scenario suites with the corresponding distribution of features representing the TOD, acknowledging that the true TOD distribution is unknown, as it can only be inferred from limited data. We apply an imprecise Bayesian method to handle limited data and uncertain priors. The imprecise Bayesian formulation produces interval-valued, uncertainty-aware estimates of representativeness, rather than a single value. We present a numerical example comparing the distributions of the scenario suite and the inferred TOD across operational categories-weather, road type, time of day, etc., under dependencies and prior uncertainty. We estimate representativeness locally (between categories) and globally as an interval.

[248] Task Specific Sharpness Aware O-RAN Resource Management using Multi Agent Reinforcement Learning

Fatemeh Lotfi, Hossein Rajoli, Fatemeh Afghah

Main category: cs.AI

TL;DR: Enhanced SAC algorithm with Sharpness-Aware Minimization in distributed MARL framework for O-RAN resource management, achieving 22% efficiency improvement.

DetailsMotivation: DRL models struggle with robustness and generalizability in dynamic O-RAN environments, requiring more stable and adaptive resource management approaches.

Method: Adaptive SAM mechanism driven by TD-error variance in distributed MARL framework, with dynamic ρ scheduling for exploration-exploitation trade-off.

Result: 22% improvement in resource allocation efficiency and superior QoS satisfaction across diverse O-RAN slices compared to conventional DRL approaches.

Conclusion: The proposed method effectively enhances training stability, generalization, and resource management efficiency in dynamic O-RAN environments.

Abstract: Next-generation networks utilize the Open Radio Access Network (O-RAN) architecture to enable dynamic resource management, facilitated by the RAN Intelligent Controller (RIC). While deep reinforcement learning (DRL) models show promise in optimizing network resources, they often struggle with robustness and generalizability in dynamic environments. This paper introduces a novel resource management approach that enhances the Soft Actor Critic (SAC) algorithm with Sharpness-Aware Minimization (SAM) in a distributed Multi-Agent RL (MARL) framework. Our method introduces an adaptive and selective SAM mechanism, where regularization is explicitly driven by temporal-difference (TD)-error variance, ensuring that only agents facing high environmental complexity are regularized. This targeted strategy reduces unnecessary overhead, improves training stability, and enhances generalization without sacrificing learning efficiency. We further incorporate a dynamic $ρ$ scheduling scheme to refine the exploration-exploitation trade-off across agents. Experimental results show our method significantly outperforms conventional DRL approaches, yielding up to a $22%$ improvement in resource allocation efficiency and ensuring superior QoS satisfaction across diverse O-RAN slices.

[249] Learning Human-Like RL Agents Through Trajectory Optimization With Action Quantization

Jian-Ting Guo, Yu-Cheng Chen, Ping-Chun Hsieh, Kuo-Hao Ho, Po-Wei Huang, Ti-Rong Wu, I-Chen Wu

Main category: cs.AI

TL;DR: MAQ is a human-like RL framework that uses macro actions from human demonstrations to improve trajectory similarity to human behavior while maintaining high rewards.

DetailsMotivation: Current RL agents often exhibit unnatural behaviors compared to humans, raising concerns for interpretability and trustworthiness. The goal is to create RL agents that behave more human-like.

Method: Formulates human-likeness as trajectory optimization and introduces Macro Action Quantization (MAQ) framework that distills human demonstrations into macro actions using Vector-Quantized VAE, adapting receding-horizon control for human-like learning.

Result: MAQ significantly improves human-likeness on D4RL Adroit benchmarks, increasing trajectory similarity scores and achieving highest human-likeness rankings among all RL agents in human evaluation studies.

Conclusion: MAQ provides a promising direction for learning human-like RL agents and can be easily integrated into various off-the-shelf RL algorithms.

Abstract: Human-like agents have long been one of the goals in pursuing artificial intelligence. Although reinforcement learning (RL) has achieved superhuman performance in many domains, relatively little attention has been focused on designing human-like RL agents. As a result, many reward-driven RL agents often exhibit unnatural behaviors compared to humans, raising concerns for both interpretability and trustworthiness. To achieve human-like behavior in RL, this paper first formulates human-likeness as trajectory optimization, where the objective is to find an action sequence that closely aligns with human behavior while also maximizing rewards, and adapts the classic receding-horizon control to human-like learning as a tractable and efficient implementation. To achieve this, we introduce Macro Action Quantization (MAQ), a human-like RL framework that distills human demonstrations into macro actions via Vector-Quantized VAE. Experiments on D4RL Adroit benchmarks show that MAQ significantly improves human-likeness, increasing trajectory similarity scores, and achieving the highest human-likeness rankings among all RL agents in the human evaluation study. Our results also demonstrate that MAQ can be easily integrated into various off-the-shelf RL algorithms, opening a promising direction for learning human-like RL agents. Our code is available at https://rlg.iis.sinica.edu.tw/papers/MAQ.

[250] Beyond GeneGPT: A Multi-Agent Architecture with Open-Source LLMs for Enhanced Genomic Question Answering

Haodong Chen, Guido Zuccon, Teerapong Leelanupab

Main category: cs.AI

TL;DR: OpenBioLLM is a modular multi-agent framework that extends GeneGPT using open-source models, achieving comparable performance with 40-50% lower latency while addressing scalability, cost, and privacy concerns of proprietary models.

DetailsMotivation: To overcome GeneGPT's limitations of relying on proprietary models (scalability issues, high costs, data privacy concerns) and enable genomic question answering with open-source solutions.

Method: Developed OpenBioLLM - a modular multi-agent framework with specialized agents for tool routing, query generation, and response validation, using open-source models (Llama 3.1, Qwen2.5, Qwen2.5 Coder) without additional fine-tuning.

Result: Outperforms or matches GeneGPT on 90% of benchmark tasks, achieving average scores of 0.849 on Gene-Turing and 0.830 on GeneHop, with 40-50% latency reduction across tasks.

Conclusion: Open-source multi-agent systems show strong potential for genomic question answering, offering comparable performance to proprietary solutions with improved efficiency, scalability, and privacy.

Abstract: Genomic question answering often requires complex reasoning and integration across diverse biomedical sources. GeneGPT addressed this challenge by combining domain-specific APIs with OpenAI’s code-davinci-002 large language model to enable natural language interaction with genomic databases. However, its reliance on a proprietary model limits scalability, increases operational costs, and raises concerns about data privacy and generalization. In this work, we revisit and reproduce GeneGPT in a pilot study using open source models, including Llama 3.1, Qwen2.5, and Qwen2.5 Coder, within a monolithic architecture; this allows us to identify the limitations of this approach. Building on this foundation, we then develop OpenBioLLM, a modular multi-agent framework that extends GeneGPT by introducing agent specialization for tool routing, query generation, and response validation. This enables coordinated reasoning and role-based task execution. OpenBioLLM matches or outperforms GeneGPT on over 90% of the benchmark tasks, achieving average scores of 0.849 on Gene-Turing and 0.830 on GeneHop, while using smaller open-source models without additional fine-tuning or tool-specific pretraining. OpenBioLLM’s modular multi-agent design reduces latency by 40-50% across benchmark tasks, significantly improving efficiency without compromising model capability. The results of our comprehensive evaluation highlight the potential of open-source multi-agent systems for genomic question answering. Code and resources are available at https://github.com/ielab/OpenBioLLM.

[251] Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs

Yu Li, Yi Huang, Guilin Qi, Junlan Feng, Nan Hu, Songlin Zhai, Haohan Xue, Yongrui Chen, Ruoyan Shen, Tongtong Wu

Main category: cs.AI

TL;DR: MAKGED is a multi-agent framework using LLMs for knowledge graph error detection that combines fine-grained subgraph embeddings with LLM query embeddings to create specialized agents that collaborate through multi-round discussions.

DetailsMotivation: Existing KG error detection methods fail to utilize fine-grained subgraph information effectively, rely on fixed graph structures, and lack transparency in decision-making, leading to suboptimal performance.

Method: Proposes MAKGED framework that concatenates fine-grained bidirectional subgraph embeddings with LLM-based query embeddings to create four specialized agents that engage in multi-round discussions using subgraph information from different dimensions.

Result: Extensive experiments on FB15K and WN18RR show MAKGED outperforms state-of-the-art methods, improving accuracy and robustness of KG evaluation.

Conclusion: The framework enables training specialized agents using domain-specific knowledge graphs for error detection, demonstrating strong industrial application potential.

Abstract: Knowledge graphs are widely used in industrial applications, making error detection crucial for ensuring the reliability of downstream applications. Existing error detection methods often fail to effectively utilize fine-grained subgraph information and rely solely on fixed graph structures, while also lacking transparency in their decision-making processes, which results in suboptimal detection performance. In this paper, we propose a novel Multi-Agent framework for Knowledge Graph Error Detection (MAKGED) that utilizes multiple large language models (LLMs) in a collaborative setting. By concatenating fine-grained, bidirectional subgraph embeddings with LLM-based query embeddings during training, our framework integrates these representations to produce four specialized agents. These agents utilize subgraph information from different dimensions to engage in multi-round discussions, thereby improving error detection accuracy and ensuring a transparent decision-making process. Extensive experiments on FB15K and WN18RR demonstrate that MAKGED outperforms state-of-the-art methods, enhancing the accuracy and robustness of KG evaluation. For specific industrial scenarios, our framework can facilitate the training of specialized agents using domain-specific knowledge graphs for error detection, which highlights the potential industrial application value of our framework. Our code and datasets are available at https://github.com/kse-ElEvEn/MAKGED.

[252] ProRAC: A Neuro-symbolic Method for Reasoning about Actions with LLM-based Progression

Haoyong Wu, Yongmei Liu

Main category: cs.AI

TL;DR: ProRAC is a neuro-symbolic framework that uses LLMs to solve Reasoning about Actions and Change (RAC) problems by extracting elements, executing actions progressively, and evaluating queries.

DetailsMotivation: To develop an effective approach for tackling RAC problems by combining neural and symbolic reasoning capabilities.

Method: Extracts RAC elements (actions and questions), progressively executes each action to derive final state, then evaluates query against progressed state.

Result: Achieves strong performance across different RAC benchmarks, domains, LLM backbones, and types of RAC tasks.

Conclusion: ProRAC demonstrates effective neuro-symbolic reasoning for RAC problems with broad applicability across various settings.

Abstract: In this paper, we propose ProRAC (Progression-based Reasoning about Actions and Change), a neuro-symbolic framework that leverages LLMs to tackle RAC problems. ProRAC extracts fundamental RAC elements including actions and questions from the problem, progressively executes each action to derive the final state, and then evaluates the query against the progressed state to arrive at an answer. We evaluate ProRAC on several RAC benchmarks, and the results demonstrate that our approach achieves strong performance across different benchmarks, domains, LLM backbones, and types of RAC tasks.

[253] Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis

Aran Nayebi

Main category: cs.AI

TL;DR: AI alignment is formalized as multi-objective optimization with information-theoretic limits showing intrinsic alignment overheads are unavoidable when objectives or agents are large, establishing a ‘No-Free-Lunch’ principle for encoding human values.

DetailsMotivation: To understand fundamental limits of AI alignment itself, not just specific methods, by formalizing it as a multi-objective optimization problem and analyzing inherent complexity barriers.

Method: Formalize alignment as ⟨M,N,ε,δ⟩-agreement problem; prove information-theoretic lower bounds on communication complexity; construct algorithms for achievability under bounded/unbounded rationality; analyze bounded-agent scenarios with noisy communication.

Result: Proved intrinsic alignment overheads are unavoidable when M or N is large; constructed explicit alignment algorithms; showed reward hacking is globally inevitable with large task spaces and finite samples due to systematic under-coverage of rare high-loss states.

Conclusion: Fundamental complexity barriers exist for AI alignment based on tasks (M), agents (N), and state-space size (D); scalable oversight must target safety-critical slices rather than uniform coverage; consensus-driven objective reduction/prioritization is necessary.

Abstract: We formalize AI alignment as a multi-objective optimization problem called $\langle M,N,\varepsilon,δ\rangle$-agreement, in which a set of $N$ agents (including humans) must reach approximate ($\varepsilon$) agreement across $M$ candidate objectives, with probability at least $1-δ$. Analyzing communication complexity, we prove an information-theoretic lower bound showing that once either $M$ or $N$ is large enough, no amount of computational power or rationality can avoid intrinsic alignment overheads. This establishes rigorous limits to alignment itself, not merely to particular methods, clarifying a “No-Free-Lunch” principle: encoding “all human values” is inherently intractable and must be managed through consensus-driven reduction or prioritization of objectives. Complementing this impossibility result, we construct explicit algorithms as achievability certificates for alignment under both unbounded and bounded rationality with noisy communication. Even in these best-case regimes, our bounded-agent and sampling analysis shows that with large task spaces ($D$) and finite samples, reward hacking is globally inevitable: rare high-loss states are systematically under-covered, implying scalable oversight must target safety-critical slices rather than uniform coverage. Together, these results identify fundamental complexity barriers – tasks ($M$), agents ($N$), and state-space size ($D$) – and offer principles for more scalable human-AI collaboration.

[254] Knowledge-Informed Automatic Feature Extraction via Collaborative Large Language Model Agents

Henrik Bradland, Morten Goodwin, Vladimir I. Zadorozhny, Per-Arne Andersen

Main category: cs.AI

TL;DR: Rogue One is a multi-agent LLM framework for automatic feature extraction that uses three specialized agents (Scientist, Extractor, Tester) with qualitative feedback and RAG integration to outperform state-of-the-art methods on tabular data.

DetailsMotivation: Existing LLM-based AutoFE methods are limited by monolithic architectures, simplistic quantitative feedback, and lack of systematic domain knowledge integration, requiring a more sophisticated approach.

Method: Decentralized multi-agent system with three specialized agents collaborating iteratively, featuring qualitative feedback mechanism, flooding-pruning strategy, and RAG integration for external knowledge.

Result: Significantly outperforms state-of-the-art methods on 19 classification and 9 regression datasets, and generates novel, testable hypotheses like identifying new biomarkers.

Conclusion: Rogue One provides a powerful framework for knowledge-informed automatic feature extraction that enables both improved performance and scientific discovery through interpretable, semantically meaningful features.

Abstract: The performance of machine learning models on tabular data is critically dependent on high-quality feature engineering. While Large Language Models (LLMs) have shown promise in automating feature extraction (AutoFE), existing methods are often limited by monolithic LLM architectures, simplistic quantitative feedback, and a failure to systematically integrate external domain knowledge. This paper introduces Rogue One, a novel, LLM-based multi-agent framework for knowledge-informed automatic feature extraction. Rogue One operationalizes a decentralized system of three specialized agents-Scientist, Extractor, and Tester-that collaborate iteratively to discover, generate, and validate predictive features. Crucially, the framework moves beyond primitive accuracy scores by introducing a rich, qualitative feedback mechanism and a “flooding-pruning” strategy, allowing it to dynamically balance feature exploration and exploitation. By actively incorporating external knowledge via an integrated retrieval-augmented (RAG) system, Rogue One generates features that are not only statistically powerful but also semantically meaningful and interpretable. We demonstrate that Rogue One significantly outperforms state-of-the-art methods on a comprehensive suite of 19 classification and 9 regression datasets. Furthermore, we show qualitatively that the system surfaces novel, testable hypotheses, such as identifying a new potential biomarker in the myocardial dataset, underscoring its utility as a tool for scientific discovery.

[255] Core Safety Values for Provably Corrigible Agents

Aran Nayebi

Main category: cs.AI

TL;DR: First complete formal solution for corrigibility in off-switch games with provable guarantees in multi-step, partially observed environments using five structurally separate utility heads combined lexicographically.

DetailsMotivation: To create a provably corrigible AI system that maintains safety guarantees even when incentives conflict, addressing limitations of approaches like Constitutional AI or RLHF/RLAIF that merge all norms into a single learned scalar.

Method: Five structurally separate utility heads (deference, switch-access preservation, truthfulness, low-impact behavior via belief-based Attainable Utility Preservation, and bounded task reward) combined lexicographically with strict weight gaps.

Result: Theorem 1 proves exact single-round corrigibility in partially observable off-switch games; Theorem 3 extends to multi-step, self-spawning agents with bounded violation probability while ensuring net human benefit, even with learned components and sub-optimal planning.

Conclusion: Structural separation of utility functions enables provable dominance of obedience and impact-limits over conflicting incentives, with additional results on decidability and verification for adversarial settings using zero-knowledge proofs.

Abstract: We introduce the first complete formal solution to corrigibility in the off-switch game, with provable guarantees in multi-step, partially observed environments. Our framework consists of five structurally separate utility heads – deference, switch-access preservation, truthfulness, low-impact behavior via a belief-based extension of Attainable Utility Preservation, and bounded task reward – combined lexicographically by strict weight gaps. Theorem 1 proves exact single-round corrigibility in the partially observable off-switch game; Theorem 3 extends the guarantee to multi-step, self-spawning agents, showing that even if each head is learned to mean-squared error $\varepsilon$ and the planner is $\varepsilon$-sub-optimal, the probability of violating any safety property is bounded while still ensuring net human benefit. In contrast to Constitutional AI or RLHF/RLAIF, which merge all norms into one learned scalar, our separation makes obedience and impact-limits provably dominate even when incentives conflict. For settings where adversaries can modify the agent, we prove that deciding whether an arbitrary post-hack agent will ever violate corrigibility is undecidable by reduction to the halting problem, then carve out a finite-horizon “decidable island” where safety can be certified in randomized polynomial time and verified with privacy-preserving, constant-round zero-knowledge proofs.

[256] SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models

Xin Gao, Shaohan Yu, Zerui Chen, Yueming Lyu, Weichen Yu, Guanghao Li, Jiyao Liu, Jianxiong Gao, Jian Liang, Ziwei Liu, Chenyang Si

Main category: cs.AI

TL;DR: SafeRBench is the first benchmark for evaluating Large Reasoning Models’ safety throughout the entire reasoning process, addressing risks in inputs, intermediate reasoning, and final outputs.

DetailsMotivation: Current safety evaluations focus mainly on output-level judgments and fail to capture dynamic risks that emerge during the reasoning process, such as gradual injection of harmful content or misleading rationales.

Method: The benchmark incorporates three key components: (1) input characterization with risk categories and levels, (2) fine-grained output analysis using micro-thought chunking to segment reasoning traces, and (3) human safety alignment to validate LLM-based evaluations.

Result: Evaluations on 19 LRMs show that SafeRBench enables detailed, multidimensional safety assessment and provides insights into risks and protective mechanisms from multiple perspectives.

Conclusion: SafeRBench represents a comprehensive approach to LRM safety evaluation that captures risks throughout the entire reasoning chain, offering a more complete safety assessment framework than previous methods.

Abstract: Large Reasoning Models (LRMs) improve answer quality through explicit chain-of-thought, yet this very capability introduces new safety risks: harmful content can be subtly injected, surface gradually, or be justified by misleading rationales within the reasoning trace. Existing safety evaluations, however, primarily focus on output-level judgments and rarely capture these dynamic risks along the reasoning process. In this paper, we present SafeRBench, the first benchmark that assesses LRM safety end-to-end – from inputs and intermediate reasoning to final outputs. (1) Input Characterization: We pioneer the incorporation of risk categories and levels into input design, explicitly accounting for affected groups and severity, and thereby establish a balanced prompt suite reflecting diverse harm gradients. (2) Fine-Grained Output Analysis: We introduce a micro-thought chunking mechanism to segment long reasoning traces into semantically coherent units, enabling fine-grained evaluation across ten safety dimensions. (3) Human Safety Alignment: We validate LLM-based evaluations against human annotations specifically designed to capture safety judgments. Evaluations on 19 LRMs demonstrate that SafeRBench enables detailed, multidimensional safety assessment, offering insights into risks and protective mechanisms from multiple perspectives.

[257] HISE-KT: Synergizing Heterogeneous Information Networks and LLMs for Explainable Knowledge Tracing with Meta-Path Optimization

Zhiyi Duan, Zixing Shi, Hongyu Yuan, Qi Wang

Main category: cs.AI

TL;DR: HISE-KT integrates heterogeneous information networks with LLMs for knowledge tracing, using LLMs to filter meta-paths and retrieve similar students to improve prediction accuracy and generate explainable analysis.

DetailsMotivation: Existing KT methods based on HINs introduce noise through manual/random meta-path selection and lack quality assessment, while LLM-based methods ignore cross-student information. Both struggle with accurate, evidence-based explanations.

Method: Builds multi-relationship HIN with diverse node types, uses LLM to intelligently score/filter meta-path instances, implements similar student retrieval based on meta-paths, and uses structured prompts to integrate target student history with similar trajectories.

Result: Outperforms existing KT baselines on four public datasets in both prediction performance and interpretability.

Conclusion: HISE-KT successfully integrates HINs with LLMs to achieve superior knowledge tracing with accurate predictions and evidence-backed explanations through automated meta-path quality assessment and similar student retrieval.

Abstract: Knowledge Tracing (KT) aims to mine students’ evolving knowledge states and predict their future question-answering performance. Existing methods based on heterogeneous information networks (HINs) are prone to introducing noises due to manual or random selection of meta-paths and lack necessary quality assessment of meta-path instances. Conversely, recent large language models (LLMs)-based methods ignore the rich information across students, and both paradigms struggle to deliver consistently accurate and evidence-based explanations. To address these issues, we propose an innovative framework, HIN-LLM Synergistic Enhanced Knowledge Tracing (HISE-KT), which seamlessly integrates HINs with LLMs. HISE-KT first builds a multi-relationship HIN containing diverse node types to capture the structural relations through multiple meta-paths. The LLM is then employed to intelligently score and filter meta-path instances and retain high-quality paths, pioneering automated meta-path quality assessment. Inspired by educational psychology principles, a similar student retrieval mechanism based on meta-paths is designed to provide a more valuable context for prediction. Finally, HISE-KT uses a structured prompt to integrate the target student’s history with the retrieved similar trajectories, enabling the LLM to generate not only accurate predictions but also evidence-backed, explainable analysis reports. Experiments on four public datasets show that HISE-KT outperforms existing KT baselines in both prediction performance and interpretability.

[258] As If We’ve Met Before: LLMs Exhibit Certainty in Recognizing Seen Files

Haodong Li, Jingqi Zhang, Xiao Cheng, Peihua Mai, Haoyu Wang, Yang Pan

Main category: cs.AI

TL;DR: COPYCHECK is a novel framework that uses uncertainty signals to detect copyrighted content in LLM training data, achieving over 90% accuracy without needing ground truth data or empirical thresholds.

DetailsMotivation: Address concerns about unauthorized use of copyrighted material in LLM training by overcoming limitations of existing Membership Inference Attacks, including LLM overconfidence, lack of ground truth data, and threshold dependency.

Method: Leverages LLM uncertainty patterns to distinguish seen/unseen content, uses strategic file segmentation into smaller snippets, and implements uncertainty-guided unsupervised clustering to eliminate threshold tuning.

Result: Achieves 90.1% average balanced accuracy on LLaMA 7b and 91.6% on LLaMA2 7b, with over 90% relative improvement compared to SOTA baseline. Maintains strong performance across different architectures including GPT-J 6B.

Conclusion: COPYCHECK presents the first application of uncertainty for copyright detection in LLMs, offering practical tools for training data transparency by turning LLM overconfidence into an asset.

Abstract: The remarkable language ability of Large Language Models (LLMs) stems from extensive training on vast datasets, often including copyrighted material, which raises serious concerns about unauthorized use. While Membership Inference Attacks (MIAs) offer potential solutions for detecting such violations, existing approaches face critical limitations and challenges due to LLMs’ inherent overconfidence, limited access to ground truth training data, and reliance on empirically determined thresholds. We present COPYCHECK, a novel framework that leverages uncertainty signals to detect whether copyrighted content was used in LLM training sets. Our method turns LLM overconfidence from a limitation into an asset by capturing uncertainty patterns that reliably distinguish between seen" (training data) and unseen" (non-training data) content. COPYCHECK further implements a two-fold strategy: (1) strategic segmentation of files into smaller snippets to reduce dependence on large-scale training data, and (2) uncertainty-guided unsupervised clustering to eliminate the need for empirically tuned thresholds. Experiment results show that COPYCHECK achieves an average balanced accuracy of 90.1% on LLaMA 7b and 91.6% on LLaMA2 7b in detecting seen files. Compared to the SOTA baseline, COPYCHECK achieves over 90% relative improvement, reaching up to 93.8% balanced accuracy. It further exhibits strong generalizability across architectures, maintaining high performance on GPT-J 6B. This work presents the first application of uncertainty for copyright detection in LLMs, offering practical tools for training data transparency.

[259] SOLID: a Framework of Synergizing Optimization and LLMs for Intelligent Decision-Making

Yinsheng Wang, Tario G You, Léonard Boussioux, Shan Liu

Main category: cs.AI

TL;DR: SOLID integrates mathematical optimization with LLMs for intelligent decision-making, using dual prices and deviation penalties for iterative collaboration that improves decision quality while maintaining modularity and data privacy.

DetailsMotivation: To synergize the contextual capabilities of large language models with mathematical optimization for enhanced decision-making, addressing the limitations of optimization-only approaches that may lack contextual understanding.

Method: A framework with iterative collaboration between optimization and LLM agents through dual prices and deviation penalties, maintaining theoretical convergence guarantees under convexity assumptions.

Result: Applied to stock portfolio investment with historical prices and financial news, SOLID demonstrated convergence across scenarios and achieved improved annualized returns compared to baseline optimizer-only methods.

Conclusion: SOLID provides a promising framework for advancing automated intelligent decision-making across diverse domains by effectively combining optimization and LLM capabilities.

Abstract: This paper introduces SOLID (Synergizing Optimization and Large Language Models for Intelligent Decision-Making), a novel framework that integrates mathematical optimization with the contextual capabilities of large language models (LLMs). SOLID facilitates iterative collaboration between optimization and LLMs agents through dual prices and deviation penalties. This interaction improves the quality of the decisions while maintaining modularity and data privacy. The framework retains theoretical convergence guarantees under convexity assumptions, providing insight into the design of LLMs prompt. To evaluate SOLID, we applied it to a stock portfolio investment case with historical prices and financial news as inputs. Empirical results demonstrate convergence under various scenarios and indicate improved annualized returns compared to a baseline optimizer-only method, validating the synergy of the two agents. SOLID offers a promising framework for advancing automated and intelligent decision-making across diverse domains.

[260] Efficiency Will Not Lead to Sustainable Reasoning AI

Philipp Wiesner, Daniel W. O’Neill, Francesca Larosa, Odej Kao

Main category: cs.AI

TL;DR: AI’s shift to complex reasoning tasks creates unsustainable energy demands as efficiency gains plateau, requiring explicit limits in optimization and governance.

DetailsMotivation: To address the growing energy footprint of reasoning AI systems as traditional efficiency improvements reach physical limits and performance scales exponentially with compute.

Method: Analysis of AI’s energy consumption trends, efficiency limits, and the scaling properties of reasoning systems compared to pattern recognition models.

Result: Identifies that reasoning AI lacks natural saturation points and continues to scale with exponential compute investments, making efficiency alone insufficient for sustainability.

Conclusion: Explicit limits must be embedded into AI optimization and governance frameworks to ensure sustainable development of reasoning AI systems.

Abstract: AI research is increasingly moving toward complex problem solving, where models are optimized not only for pattern recognition but for multi-step reasoning. Historically, computing’s global energy footprint has been stabilized by sustained efficiency gains and natural saturation thresholds in demand. But as efficiency improvements are approaching physical limits, emerging reasoning AI lacks comparable saturation points: performance is no longer limited by the amount of available training data but continues to scale with exponential compute investments in both training and inference. This paper argues that efficiency alone will not lead to sustainable reasoning AI and discusses research and policy directions to embed explicit limits into the optimization and governance of such systems.

[261] Realist and Pluralist Conceptions of Intelligence and Their Implications on AI Research

Ninell Oldenburg, Ruchira Dhar, Anders Søgaard

Main category: cs.AI

TL;DR: The paper identifies two competing conceptions of intelligence in AI research: Intelligence Realism (universal, measurable intelligence) and Intelligence Pluralism (diverse, context-dependent capacities), and shows how these implicit assumptions shape methodology, interpretation, and risk assessment.

DetailsMotivation: To reveal how unstated philosophical assumptions about intelligence fundamentally influence AI research practices and create systematic disagreements across the field.

Method: Analysis of current debates in AI research to demonstrate how Intelligence Realism and Pluralism shape empirical evidence interpretation across methodology, interpretation, and risk assessment domains.

Result: Identified that these competing conceptions produce different approaches to model selection, benchmark design, experimental validation; lead to contradictory readings of empirical phenomena; and generate categorically different AI risk assessments.

Conclusion: Making explicit these underlying assumptions about intelligence can contribute to clearer understanding of disagreements in AI research and potentially bridge methodological divides.

Abstract: In this paper, we argue that current AI research operates on a spectrum between two different underlying conceptions of intelligence: Intelligence Realism, which holds that intelligence represents a single, universal capacity measurable across all systems, and Intelligence Pluralism, which views intelligence as diverse, context-dependent capacities that cannot be reduced to a single universal measure. Through an analysis of current debates in AI research, we demonstrate how the conceptions remain largely implicit yet fundamentally shape how empirical evidence gets interpreted across a wide range of areas. These underlying views generate fundamentally different research approaches across three areas. Methodologically, they produce different approaches to model selection, benchmark design, and experimental validation. Interpretively, they lead to contradictory readings of the same empirical phenomena, from capability emergence to system limitations. Regarding AI risk, they generate categorically different assessments: realists view superintelligence as the primary risk and search for unified alignment solutions, while pluralists see diverse threats across different domains requiring context-specific solutions. We argue that making explicit these underlying assumptions can contribute to a clearer understanding of disagreements in AI research.

[262] Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration

Yifu Guo, Zishan Xu, Zhiyuan Yao, Yuquan Lu, Jiaye Lin, Sen Hu, Zhenheng Tang, Yingchao Li, Huacan Wang, Ronghao Chen

Main category: cs.AI

TL;DR: Octopus is a new multimodal reasoning paradigm that autonomously explores diverse reasoning pathways and dynamically selects appropriate capabilities, achieving state-of-the-art performance on comprehensive benchmarks.

DetailsMotivation: Existing multimodal reasoning models lack the human-like ability to autonomously explore diverse reasoning pathways and adapt to dynamically changing capability requirements in real-world tasks.

Method: Proposes Octopus with six core capabilities for multimodal reasoning, enabling autonomous exploration and dynamic capability selection based on current state.

Result: Octopus achieves the best performance on the vast majority of tasks in Octopus-Bench, demonstrating superior capability coordination in agentic multimodal reasoning.

Conclusion: The six-capability orchestration paradigm is crucial for effective agentic multimodal reasoning, enabling autonomous exploration and dynamic adaptation to task requirements.

Abstract: Existing multimodal reasoning models and frameworks suffer from fundamental architectural limitations: most lack the human-like ability to autonomously explore diverse reasoning pathways-whether in direct inference, tool-driven visual exploration, programmatic visual manipulation, or intrinsic visual imagination. Consequently, they struggle to adapt to dynamically changing capability requirements in real-world tasks. Meanwhile, humans exhibit a complementary set of thinking abilities when addressing such tasks, whereas existing methods typically cover only a subset of these dimensions. Inspired by this, we propose Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration, a new paradigm for multimodal agentic reasoning. We define six core capabilities essential for multimodal reasoning and organize a comprehensive evaluation benchmark, Octopus-Bench, accordingly. Octopus is capable of autonomously exploring during reasoning and dynamically selecting the most appropriate capability based on the current state. Experimental results show that Octopus achieves the best performance on the vast majority of tasks in Octopus-Bench, highlighting the crucial role of capability coordination in agentic multimodal reasoning.

[263] Terra Nova: A Comprehensive Challenge Environment for Intelligent Agents

Trevor McInroe

Main category: cs.AI

TL;DR: Terra Nova is a comprehensive challenge environment for RL research inspired by Civilization V that integrates multiple canonical RL challenges simultaneously rather than aggregating unrelated tasks.

DetailsMotivation: To create a single environment where multiple RL challenges (partial observability, credit assignment, representation learning, enormous action spaces) arise simultaneously, requiring integrated long-horizon understanding across interacting variables.

Method: Developed Terra Nova, a comprehensive challenge environment inspired by Civilization V that presents multiple canonical RL challenges in an integrated manner rather than as independent parallel tasks.

Result: Created a new benchmark environment that demands deep reasoning across many interacting challenges rather than just policy switching between unrelated tasks.

Conclusion: Terra Nova provides a more meaningful test of RL agents’ integrated reasoning capabilities by presenting multiple canonical challenges simultaneously in a single cohesive environment.

Abstract: We introduce Terra Nova, a new comprehensive challenge environment (CCE) for reinforcement learning (RL) research inspired by Civilization V. A CCE is a single environment in which multiple canonical RL challenges (e.g., partial observability, credit assignment, representation learning, enormous action spaces, etc.) arise simultaneously. Mastery therefore demands integrated, long-horizon understanding across many interacting variables. We emphasize that this definition excludes challenges that only aggregate unrelated tasks in independent, parallel streams (e.g., learning to play all Atari games at once). These aggregated multitask benchmarks primarily asses whether an agent can catalog and switch among unrelated policies rather than test an agent’s ability to perform deep reasoning across many interacting challenges.

[264] IPR-1: Interactive Physical Reasoner

Mingyu Zhang, Lifeng Zhuo, Tianxi Tan, Guocan Xie, Xian Nie, Yan Li, Renjie Zhao, Zizhu He, Ziyu Wang, Jiting Cai, Yong-Lu Li

Main category: cs.AI

TL;DR: IPR (Interactive Physical Reasoner) combines world-model rollouts with VLM policies and PhysCode action representation to achieve human-like physical reasoning that improves with more game experience and transfers to unseen games.

DetailsMotivation: To develop agents that can acquire human-like physical reasoning through interaction, similar to how humans learn by observing environments and internalizing physics and causality.

Method: Proposed IPR using world-model rollouts to score and reinforce VLM policies, and introduced PhysCode - a physics-centric action code that aligns semantic intent with dynamics for shared action space. Pretrained on 1,000+ heterogeneous games with diverse physical mechanisms.

Result: IPR performs robustly on three human-like reasoning levels (Survival, Curiosity, Utility), matches GPT-5 overall, surpasses it on Curiosity, improves with more training games and interaction steps, and zero-shot transfers to unseen games.

Conclusion: Physics-centric interaction provides a path to steadily improving physical reasoning, with complementary integration of reasoning and imagination capabilities.

Abstract: Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience. We study this in a Game-to-Unseen (G2U) setting, curating 1,000+ heterogeneous games with diverse physical and causal mechanisms, and evaluate at three human-like levels: Survival, Curiosity, Utility, from primitive intuition to goal-driven reasoning. Our analysis reveals complementary failures: VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world-model rollouts to score and reinforce a VLM’s policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, our IPR performs robustly on three levels, matches GPT-5 overall, and surpasses it on Curiosity. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games. These results support physics-centric interaction as a path to steadily improving physical reasoning.

[265] Know Your Intent: An Autonomous Multi-Perspective LLM Agent Framework for DeFi User Transaction Intent Mining

Qian’ang Mao, Yuxuan Zhang, Jiaman Chen, Wenjun Zhou, Jiaqi Yan

Main category: cs.AI

TL;DR: TIM framework uses multi-agent LLM system with domain experts to infer user intents in DeFi transactions, outperforming existing methods.

DetailsMotivation: Understanding user intent in DeFi is challenging due to complex smart contracts, on/off-chain factors, and opaque hex logs. Existing methods lack deep semantic insight.

Method: Proposes Transaction Intent Mining (TIM) framework with DeFi intent taxonomy, multi-agent LLM system, Meta-Level Planner for coordination, Question Solvers for multi-modal data, and Cognitive Evaluator to mitigate hallucinations.

Result: TIM significantly outperforms machine learning models, single LLMs, and single Agent baselines in experiments.

Conclusion: TIM provides reliable understanding of user motivations in DeFi and context-aware explanations for complex blockchain activity, addressing core challenges in intent inference.

Abstract: As Decentralized Finance (DeFi) develops, understanding user intent behind DeFi transactions is crucial yet challenging due to complex smart contract interactions, multifaceted on-/off-chain factors, and opaque hex logs. Existing methods lack deep semantic insight. To address this, we propose the Transaction Intent Mining (TIM) framework. TIM leverages a DeFi intent taxonomy built on grounded theory and a multi-agent Large Language Model (LLM) system to robustly infer user intents. A Meta-Level Planner dynamically coordinates domain experts to decompose multiple perspective-specific intent analyses into solvable subtasks. Question Solvers handle the tasks with multi-modal on/off-chain data. While a Cognitive Evaluator mitigates LLM hallucinations and ensures verifiability. Experiments show that TIM significantly outperforms machine learning models, single LLMs, and single Agent baselines. We also analyze core challenges in intent inference. This work helps provide a more reliable understanding of user motivations in DeFi, offering context-aware explanations for complex blockchain activity.

[266] Exploring the use of AI authors and reviewers at Agents4Science

Federico Bianchi, Owen Queen, Nitya Thakkar, Eric Sun, James Zou

Main category: cs.AI

TL;DR: AI agents served as primary authors and reviewers in the first Agents4Science conference, with humans as co-authors and co-reviewers, exploring AI capabilities in scientific research.

DetailsMotivation: Growing interest in using AI agents for scientific research, but fundamental questions remain about their capabilities as scientists and reviewers.

Method: Organized Agents4Science conference where AI agents served as both primary authors and reviewers, with humans as co-authors and co-reviewers.

Result: Key learnings from the conference about AI capabilities in scientific roles.

Conclusion: Discussion of implications for human-AI collaboration in science based on the conference findings.

Abstract: There is growing interest in using AI agents for scientific research, yet fundamental questions remain about their capabilities as scientists and reviewers. To explore these questions, we organized Agents4Science, the first conference in which AI agents serve as both primary authors and reviewers, with humans as co-authors and co-reviewers. Here, we discuss the key learnings from the conference and their implications for human-AI collaboration in science.

[267] What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity

Alexis Audran-Reiss, Jordi Armengol Estapé, Karen Hambardzumyan, Amar Budhiraja, Martin Josifoski, Edan Toledo, Rishi Hazra, Despoina Magka, Michael Shvartsman, Parth Pathak, Justine T Kao, Lucia Cipolina-Kun, Bhavul Gauri, Jean-Christophe Gagnon-Audet, Emanuel Tewolde, Jenny Zhang, Taco Cohen, Yossi Adi, Tatiana Shavrina, Yoram Bachrach

Main category: cs.AI

TL;DR: The paper investigates how ideation diversity affects AI research agent performance, finding that higher diversity leads to better results across different models and agent scaffolds.

DetailsMotivation: AI research agents can accelerate scientific progress but the factors driving their success are not well understood, particularly the role of ideation diversity in agent performance.

Method: Analyzed agent trajectories on MLE-bench benchmark across different models and scaffolds; conducted controlled experiments modifying ideation diversity; examined additional evaluation metrics beyond standard scoring.

Result: Higher-performing agents showed increased ideation diversity; controlled experiments confirmed that higher ideation diversity results in stronger performance; findings held across multiple evaluation metrics.

Conclusion: Ideation diversity is a key factor driving AI research agent success, with higher diversity consistently correlating with improved performance across different models and evaluation frameworks.

Abstract: AI research agents offer the promise to accelerate scientific progress by automating the design, implementation, and training of machine learning models. However, the field is still in its infancy, and the key factors driving the success or failure of agent trajectories are not fully understood. We examine the role that ideation diversity plays in agent performance. First, we analyse agent trajectories on MLE-bench, a well-known benchmark to evaluate AI research agents, across different models and agent scaffolds. Our analysis reveals that different models and agent scaffolds yield varying degrees of ideation diversity, and that higher-performing agents tend to have increased ideation diversity. Further, we run a controlled experiment where we modify the degree of ideation diversity, demonstrating that higher ideation diversity results in stronger performance. Finally, we strengthen our results by examining additional evaluation metrics beyond the standard medal-based scoring of MLE-bench, showing that our findings still hold across other agent performance metrics.

[268] Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities

Weixiang Zhao, Xingyu Sui, Jiahe Guo, Yulin Hu, Yang Deng, Yanyan Zhao, Xuda Zhi, Yongbo Huang, Hao He, Wanxiang Che, Ting Liu, Bing Qin

Main category: cs.AI

TL;DR: Large Reasoning Models (LRMs) gain deliberative reasoning capabilities but lose foundational abilities like helpfulness and harmlessness, with increased inference costs. Adaptive reasoning strategies can mitigate these drawbacks.

DetailsMotivation: To systematically evaluate the trade-offs in Large Reasoning Models when acquiring deliberative reasoning capabilities, and identify strategies to maintain foundational abilities while controlling inference costs.

Method: Conducted systematic evaluation across various model families (DeepSeek, Qwen, LLaMA) and scales (7B to 32B), testing adaptive reasoning approaches including Zero-Thinking, Less-Thinking, and Summary-Thinking modes.

Result: Found that deliberative reasoning capabilities significantly reduce foundational capabilities (helpfulness, harmlessness) and substantially increase inference costs. Adaptive reasoning effectively alleviates these drawbacks.

Conclusion: There is a critical need for developing more versatile LRMs capable of dynamically allocating inference-time compute according to specific task characteristics to balance reasoning capabilities with foundational abilities and cost efficiency.

Abstract: Recent advancements in Large Reasoning Models (LRMs), such as OpenAI’s o1/o3 and DeepSeek-R1, have demonstrated remarkable performance in specialized reasoning tasks through human-like deliberative thinking and long chain-of-thought reasoning. However, our systematic evaluation across various model families (DeepSeek, Qwen, and LLaMA) and scales (7B to 32B) reveals that acquiring these deliberative reasoning capabilities significantly reduces the foundational capabilities of LRMs, including notable declines in helpfulness and harmlessness, alongside substantially increased inference costs. Importantly, we demonstrate that adaptive reasoning – employing modes like Zero-Thinking, Less-Thinking, and Summary-Thinking – can effectively alleviate these drawbacks. Our empirical insights underline the critical need for developing more versatile LRMs capable of dynamically allocating inference-time compute according to specific task characteristics.

[269] Driving with Regulation: Trustworthy and Interpretable Decision-Making for Autonomous Driving with Retrieval-Augmented Reasoning

Tianhui Cai, Yifan Liu, Zewei Zhou, Haoxuan Ma, Seth Z. Zhao, Zhiwen Wu, Xu Han, Zhiyu Huang, Jiaqi Ma

Main category: cs.AI

TL;DR: DriveReg is an interpretable framework that helps autonomous vehicles understand and follow region-specific traffic regulations using RAG-based retrieval and LLM reasoning.

DetailsMotivation: Traffic regulations are complex, context-dependent, and vary by region, making it challenging for autonomous vehicles to comply using conventional rule-based approaches.

Method: Integrates a RAG-based Traffic Regulation Retrieval Agent to find relevant rules and an LLM-powered Reasoning Agent to evaluate actions for legal compliance and safety.

Result: Validated on the DriveReg Scenarios Dataset and real-world deployment, showing strong performance and robustness across diverse environments in Boston, Singapore, and Los Angeles.

Conclusion: The framework successfully enables autonomous vehicles to understand and adhere to region-specific traffic laws while maintaining interpretability for transparency and trustworthiness.

Abstract: Understanding and adhering to traffic regulations is essential for autonomous vehicles to ensure safety and trustworthiness. However, traffic regulations are complex, context-dependent, and differ between regions, posing a major challenge to conventional rule-based decision-making approaches. We present an interpretable, regulation-aware decision-making framework, DriveReg, which enables autonomous vehicles to understand and adhere to region-specific traffic laws and safety guidelines. The framework integrates a Retrieval-Augmented Generation (RAG)-based Traffic Regulation Retrieval Agent, which retrieves relevant rules from regulatory documents based on the current situation, and a Large Language Model (LLM)-powered Reasoning Agent that evaluates actions for legal compliance and safety. Our design emphasizes interpretability to enhance transparency and trustworthiness. To support systematic evaluation, we introduce the DriveReg Scenarios Dataset, a comprehensive dataset of driving scenarios across Boston, Singapore, and Los Angeles, with both hypothesized text-based cases and real-world driving data, constructed and annotated to evaluate models’ capacity for regulation understanding and reasoning. We validate our framework on the DriveReg Scenarios Dataset and real-world deployment, demonstrating strong performance and robustness across diverse environments.

[270] Agent-SAMA: State-Aware Mobile Assistant

Linqiang Guo, Wei Liu, Yi Wen Heng, Tse-Hsun, Chen, Yang Wang

Main category: cs.AI

TL;DR: Agent-SAMA is a state-aware multi-agent framework that models mobile app execution as a Finite State Machine to improve GUI agents’ task planning, execution verification, and error recovery capabilities.

DetailsMotivation: Existing mobile GUI agents are reactive and lack structured representation of app navigation flow, limiting their ability to understand execution context, detect unexpected results, and recover from errors.

Method: Models app execution as a Finite State Machine with UI screens as states and user actions as transitions. Implements four specialized agents that collaboratively construct and use FSMs in real time.

Result: Achieved 84.0% success rate and 71.9% recovery rate on Mobile-Eval-E, 80.0% success rate with 66.7% recovery rate on SPA-Bench, and 63.7% success rate on AndroidWorld, outperforming prior methods by up to 12% in task success and 13.8% in recovery success.

Conclusion: Structured state modeling enhances robustness and can serve as a lightweight, model-agnostic memory layer for future GUI agents.

Abstract: Mobile Graphical User Interface (GUI) agents aim to autonomously complete tasks within or across apps based on user instructions. While recent Multimodal Large Language Models (MLLMs) enable these agents to interpret UI screens and perform actions, existing agents remain fundamentally reactive. They reason over the current UI screen but lack a structured representation of the app navigation flow, limiting GUI agents’ ability to understand execution context, detect unexpected execution results, and recover from errors. We introduce Agent-SAMA, a state-aware multi-agent framework that models app execution as a Finite State Machine (FSM), treating UI screens as states and user actions as transitions. Agent-SAMA implements four specialized agents that collaboratively construct and use FSMs in real time to guide task planning, execution verification, and recovery. We evaluate Agent-SAMA on two types of benchmarks: cross-app (Mobile-Eval-E, SPA-Bench) and mostly single-app (AndroidWorld). On Mobile-Eval-E, Agent-SAMA achieves an 84.0% success rate and a 71.9% recovery rate. On SPA-Bench, it reaches an 80.0% success rate with a 66.7% recovery rate. Compared to prior methods, Agent-SAMA improves task success by up to 12% and recovery success by 13.8%. On AndroidWorld, Agent-SAMA achieves a 63.7% success rate, outperforming the baselines. Our results demonstrate that structured state modeling enhances robustness and can serve as a lightweight, model-agnostic memory layer for future GUI agents.

[271] MAGIC: Multi-Agent Argumentation and Grammar Integrated Critiquer

Joaquín Jordán, Xavier Yin, Melissa Fabros, Gireeja Ranade, Narges Norouzi

Main category: cs.AI

TL;DR: MAGIC is a multi-agent framework for automated essay scoring and feedback that achieves strong performance on college-level GRE essays while providing interpretable feedback through specialized agents.

DetailsMotivation: Existing AES/AEF systems focus too much on scoring accuracy over feedback quality and are mainly evaluated on pre-secondary school writing, lacking support for college-level assessment.

Method: Uses five specialized agents to evaluate prompt adherence, persuasiveness, organization, vocabulary, and grammar for holistic scoring and detailed feedback generation.

Result: Achieves substantial to near-perfect scoring agreement with humans on GRE data, outperforming baseline LLM models while providing enhanced interpretability. Also shows strong feedback quality and naturalness compared to human feedback.

Conclusion: MAGIC demonstrates effective multi-agent approach for college-level essay assessment, balancing scoring accuracy with high-quality interpretable feedback.

Abstract: Automated Essay Scoring (AES) and Automatic Essay Feedback (AEF) systems aim to reduce the workload of human raters in educational assessment. However, most existing systems prioritize numerical scoring accuracy over feedback quality and are primarily evaluated on pre-secondary school level writing. This paper presents Multi-Agent Argumentation and Grammar Integrated Critiquer (MAGIC), a framework using five specialized agents to evaluate prompt adherence, persuasiveness, organization, vocabulary, and grammar for both holistic scoring and detailed feedback generation. To support evaluation at the college level, we collated a dataset of Graduate Record Examination (GRE) practice essays with expert-evaluated scores and feedback. MAGIC achieves substantial to near-perfect scoring agreement with humans on the GRE data, outperforming baseline LLM models while providing enhanced interpretability through its multi-agent approach. We also compare MAGIC’s feedback generation capabilities against ground truth human feedback and baseline models, finding that MAGIC achieves strong feedback quality and naturalness.

[272] Best-Effort Policies for Robust Markov Decision Processes

Alessandro Abate, Thom Badings, Giuseppe De Giacomo, Francesco Fabiano

Main category: cs.AI

TL;DR: The paper proposes optimal robust best-effort (ORBE) policies for robust MDPs, which maximize worst-case expected return while also achieving maximal expected return under non-adversarial transition probabilities, serving as principled tie-breakers among optimal robust policies.

DetailsMotivation: Standard robust MDPs compute policies that maximize expected return under adversarial transition probabilities, but there can be multiple optimal robust policies that perform equally in worst-case but differently under non-adversarial scenarios. This creates ambiguity in policy selection.

Method: Proposes ORBE policies that combine worst-case optimization with best-effort performance under non-adversarial conditions. Characterizes ORBE policy structure and presents an algorithm to compute them with manageable overhead compared to standard robust value iteration.

Result: Proves ORBE policies always exist, provides theoretical characterization of their structure, and demonstrates through numerical experiments that the approach is feasible and computationally manageable.

Conclusion: ORBE policies offer a principled tie-breaking mechanism among optimal robust policies in RMDPs, ensuring both worst-case robustness and good performance under more favorable conditions, with practical computational feasibility.

Abstract: We study the common generalization of Markov decision processes (MDPs) with sets of transition probabilities, known as robust MDPs (RMDPs). A standard goal in RMDPs is to compute a policy that maximizes the expected return under an adversarial choice of the transition probabilities. If the uncertainty in the probabilities is independent between the states, known as s-rectangularity, such optimal robust policies can be computed efficiently using robust value iteration. However, there might still be multiple optimal robust policies, which, while equivalent with respect to the worst-case, reflect different expected returns under non-adversarial choices of the transition probabilities. Hence, we propose a refined policy selection criterion for RMDPs, drawing inspiration from the notions of dominance and best-effort in game theory. Instead of seeking a policy that only maximizes the worst-case expected return, we additionally require the policy to achieve a maximal expected return under different (i.e., not fully adversarial) transition probabilities. We call such a policy an optimal robust best-effort (ORBE) policy. We prove that ORBE policies always exist, characterize their structure, and present an algorithm to compute them with a manageable overhead compared to standard robust value iteration. ORBE policies offer a principled tie-breaker among optimal robust policies. Numerical experiments show the feasibility of our approach.

[273] Enabling MoE on the Edge via Importance-Driven Expert Scheduling

Guoying Zhu, Meng Li, Haipeng Dai, Xuechen Liu, Weijun Wang, Keran Li, Jun xiao, Ligeng Chen, Wei Wang

Main category: cs.AI

TL;DR: A novel expert offloading approach for Mixture of Experts models that uses expert importance to guide substitution decisions, reducing memory usage and PCIe overhead while maintaining accuracy.

DetailsMotivation: Deploying Mixture of Experts models on consumer-grade edge hardware is constrained by limited device memory, requiring efficient expert offloading strategies that go beyond traditional scheduling approaches.

Method: Leverages expert importance to substitute low-importance activated experts with functionally similar ones already cached in GPU memory, plus a scheduling policy that maximizes GPU-cached expert reuse ratio.

Result: Achieves 48% lower decoding latency with over 60% expert cache hit rate while maintaining nearly lossless accuracy.

Conclusion: The approach effectively reduces memory usage and data transfer while eliminating PCIe overhead, making MoE deployment on edge hardware more practical.

Abstract: The Mixture of Experts (MoE) architecture has emerged as a key technique for scaling Large Language Models by activating only a subset of experts per query. Deploying MoE on consumer-grade edge hardware, however, is constrained by limited device memory, making dynamic expert offloading essential. Unlike prior work that treats offloading purely as a scheduling problem, we leverage expert importance to guide decisions, substituting low-importance activated experts with functionally similar ones already cached in GPU memory, thereby preserving accuracy. As a result, this design reduces memory usage and data transfer, while largely eliminating PCIe overhead. In addition, we introduce a scheduling policy that maximizes the reuse ratio of GPU-cached experts, further boosting efficiency. Extensive evaluations show that our approach delivers 48% lower decoding latency with over 60% expert cache hit rate, while maintaining nearly lossless accuracy.

[274] MedLA: A Logic-Driven Multi-Agent Framework for Complex Medical Reasoning with Large Language Models

Siqi Ma, Jiajie Huang, Fan Zhang, Jinlin Wu, Yue Shen, Guohui Fan, Zhu Zhang, Zelin Zang

Main category: cs.AI

TL;DR: MedLA is a logic-driven multi-agent framework that uses syllogistic triads to organize reasoning into explicit logical trees, enabling transparent medical reasoning through multi-round discussions and contradiction resolution.

DetailsMotivation: Existing multi-agent approaches for medical QA rely on fixed roles or shallow interactions, limiting their ability to detect and resolve fine-grained logical inconsistencies in complex medical reasoning.

Method: Each agent organizes reasoning into explicit logical trees based on syllogistic triads (major premise, minor premise, conclusion). Agents engage in multi-round, graph-guided discussions to compare and refine logic trees through error correction and contradiction resolution.

Result: MedLA consistently outperforms static role-based systems and single-agent baselines on MedDDx and standard medical QA benchmarks, achieving state-of-the-art performance across both open-source and commercial LLM backbones.

Conclusion: MedLA provides a generalizable paradigm for trustworthy medical reasoning that scales effectively across different LLM architectures while enabling transparent inference and premise-level alignment.

Abstract: Answering complex medical questions requires not only domain expertise and patient-specific information, but also structured and multi-perspective reasoning. Existing multi-agent approaches often rely on fixed roles or shallow interaction prompts, limiting their ability to detect and resolve fine-grained logical inconsistencies. To address this, we propose \textsc{MedLA}, a logic-driven multi-agent framework built on large language models. Each agent organizes its reasoning process into an explicit logical tree based on syllogistic triads (major premise, minor premise, and conclusion), enabling transparent inference and premise-level alignment. Agents engage in a multi-round, graph-guided discussion to compare and iteratively refine their logic trees, achieving consensus through error correction and contradiction resolution. We demonstrate that \textsc{MedLA} consistently outperforms both static role-based systems and single-agent baselines on challenging benchmarks such as MedDDx and standard medical QA tasks. Furthermore, \textsc{MedLA} scales effectively across both open-source and commercial LLM backbones, achieving state-of-the-art performance and offering a generalizable paradigm for trustworthy medical reasoning.

[275] SRNN: Spatiotemporal Relational Neural Network for Intuitive Physics Understanding

Fei Yang

Main category: cs.AI

TL;DR: SRNN is a brain-inspired neural network that uses unified spatiotemporal representations and Hebbian learning to understand intuitive physics, achieving competitive performance on CLEVRER benchmark while providing interpretable error analysis.

DetailsMotivation: To bridge the gap between human intuitive physics understanding and machine capabilities by shifting towards brain-inspired computational principles.

Method: Introduces Spatiotemporal Relational Neural Network (SRNN) with unified neural representations for object attributes, relations, and timeline, using Hebbian “Fire Together, Wire Together” mechanism across dedicated What and How pathways.

Result: Achieves competitive performance on CLEVRER benchmark, confirms capability to represent essential spatiotemporal relations, and enables precise error root cause analysis through white-box nature.

Conclusion: Provides proof-of-concept that key principles of biological intelligence can be translated into engineered systems for intuitive physics understanding in constrained environments.

Abstract: Human prowess in intuitive physics remains unmatched by machines. To bridge this gap, we argue for a fundamental shift towards brain-inspired computational principles. This paper introduces the Spatiotemporal Relational Neural Network (SRNN), a model that establishes a unified neural representation for object attributes, relations, and timeline, with computations governed by a Hebbian ``Fire Together, Wire Together’’ mechanism across dedicated \textit{What} and \textit{How} pathways. This unified representation is directly used to generate structured linguistic descriptions of the visual scene, bridging perception and language within a shared neural substrate. On the CLEVRER benchmark, SRNN achieves competitive performance, thereby confirming its capability to represent essential spatiotemporal relations from the visual stream. Cognitive ablation analysis further reveals a benchmark bias, outlining a path for a more holistic evaluation. Finally, the white-box nature of SRNN enables precise pinpointing of error root causes. Our work provides a proof-of-concept that confirms the viability of translating key principles of biological intelligence into engineered systems for intuitive physics understanding in constrained environments.

[276] TimeFlow: Towards Stochastic-Aware and Efficient Time Series Generation via Flow Matching Modeling

He Panjing, Cheng Mingyue, Li Li, Zhang XiaoHan

Main category: cs.AI

TL;DR: TimeFlow is a novel SDE-based flow matching framework for time series generation that integrates stochastic differential equations to better capture temporal randomness while maintaining efficiency.

DetailsMotivation: Existing methods face challenges: diffusion models are computationally inefficient, while conventional flow matching with ODEs fails to explicitly capture the stochasticity inherent in real-world time series data.

Method: Proposes TimeFlow framework using SDE-based flow matching with encoder-only architecture, component-wise decomposed velocity field, and augmented optimization with additional stochastic term.

Result: Extensive experiments show TimeFlow consistently outperforms strong baselines in generation quality, diversity, and efficiency across diverse datasets.

Conclusion: TimeFlow provides a flexible and general framework for both unconditional and conditional time series generation, effectively balancing stochastic modeling with computational efficiency.

Abstract: Generating high-quality time series data has emerged as a critical research topic due to its broad utility in supporting downstream time series mining tasks. A major challenge lies in modeling the intrinsic stochasticity of temporal dynamics, as real-world sequences often exhibit random fluctuations and localized variations. While diffusion models have achieved remarkable success, their generation process is computationally inefficient, often requiring hundreds to thousands of expensive function evaluations per sample. Flow matching has emerged as a more efficient paradigm, yet its conventional ordinary differential equation (ODE)-based formulation fails to explicitly capture stochasticity, thereby limiting the fidelity of generated sequences. By contrast, stochastic differential equation (SDE) are naturally suited for modeling randomness and uncertainty. Motivated by these insights, we propose TimeFlow, a novel SDE-based flow matching framework that integrates a encoder-only architecture. Specifically, we design a component-wise decomposed velocity field to capture the multi-faceted structure of time series and augment the vanilla flow-matching optimization with an additional stochastic term to enhance representational expressiveness. TimeFlow is flexible and general, supporting both unconditional and conditional generation tasks within a unified framework. Extensive experiments across diverse datasets demonstrate that our model consistently outperforms strong baselines in generation quality, diversity, and efficiency.

Wenhan Yu, Xinbo Lin, Lanxin Ni, Jinhua Cheng, Lei Sha

Main category: cs.AI

TL;DR: MSLR is the first Chinese multi-step legal reasoning dataset using IRAC framework, created via Human-LLM collaborative annotation, showing LLMs struggle with complex legal reasoning but benefit from Self-Initiated Chain-of-Thought prompts.

DetailsMotivation: Existing legal benchmarks often conflate factual recall with genuine inference, fragment reasoning processes, and overlook reasoning quality, limiting proper evaluation of LLMs' legal reasoning capabilities.

Method: Created MSLR dataset using IRAC framework from real judicial documents, developed scalable Human-LLM collaborative annotation pipeline for step-level reasoning annotations, and evaluated multiple LLMs with Self-Initiated Chain-of-Thought prompts.

Result: LLMs show only moderate performance on MSLR, highlighting challenges in complex legal reasoning. Self-Initiated Chain-of-Thought prompts improve reasoning coherence and quality, outperforming human-designed prompts.

Conclusion: MSLR advances LLM reasoning and Chain-of-Thought strategies, providing open resources for future legal reasoning research while demonstrating the difficulty of adapting LLMs to complex legal inference tasks.

Abstract: Large language models (LLMs) have demonstrated strong reasoning abilities across specialized domains, motivating research into their application to legal reasoning. However, existing legal benchmarks often conflate factual recall with genuine inference, fragment the reasoning process, and overlook the quality of reasoning. To address these limitations, we introduce MSLR, the first Chinese multi-step legal reasoning dataset grounded in real-world judicial decision making. MSLR adopts the IRAC framework (Issue, Rule, Application, Conclusion) to model structured expert reasoning from official legal documents. In addition, we design a scalable Human-LLM collaborative annotation pipeline that efficiently produces fine-grained step-level reasoning annotations and provides a reusable methodological framework for multi-step reasoning datasets. Evaluation of multiple LLMs on MSLR shows only moderate performance, highlighting the challenges of adapting to complex legal reasoning. Further experiments demonstrate that Self-Initiated Chain-of-Thought prompts generated by models autonomously improve reasoning coherence and quality, outperforming human-designed prompts. MSLR contributes to advancing LLM reasoning and Chain-of-Thought strategies and offers open resources for future research. The dataset and code are available at https://github.com/yuwenhan07/MSLR-Bench and https://law.sjtu.edu.cn/flszyjzx/index.html.

[278] Combining LLM Semantic Reasoning with GNN Structural Modeling for Multi-View Multi-Label Feature Selection

Zhiqi Chen, Yuzhou Liu, Jiarui Liu, Wanfu Gao

Main category: cs.AI

TL;DR: Proposes a novel MVMLFS method combining LLM semantic reasoning with GNN structural modeling to jointly leverage statistical and semantic information for feature selection in multi-view multi-label learning.

DetailsMotivation: Existing MVMLFS methods mainly focus on statistical information but neglect semantic information, which is crucial for understanding complex relationships in high-dimensional, multimodal data from domains like social media and bioinformatics.

Method: Three components: (1) LLM as evaluation agent to assess semantic relevance among features, views, and labels; (2) Semantic-aware heterogeneous graph with semantic and statistical graphs; (3) Lightweight GAT to learn node embeddings as feature saliency scores.

Result: Experimental results on multiple benchmark datasets show superiority over state-of-the-art baselines, with effectiveness even on small-scale datasets, demonstrating robustness, flexibility, and generalization ability.

Conclusion: The proposed method successfully integrates semantic and statistical information through LLMs and GNNs, providing an effective and generalizable solution for MVMLFS that works well across different dataset scales.

Abstract: Multi-view multi-label feature selection aims to identify informative features from heterogeneous views, where each sample is associated with multiple interdependent labels. This problem is particularly important in machine learning involving high-dimensional, multimodal data such as social media, bioinformatics or recommendation systems. Existing Multi-View Multi-Label Feature Selection (MVMLFS) methods mainly focus on analyzing statistical information of data, but seldom consider semantic information. In this paper, we aim to use these two types of information jointly and propose a method that combines Large Language Models (LLMs) semantic reasoning with Graph Neural Networks (GNNs) structural modeling for MVMLFS. Specifically, the method consists of three main components. (1) LLM is first used as an evaluation agent to assess the latent semantic relevance among feature, view, and label descriptions. (2) A semantic-aware heterogeneous graph with two levels is designed to represent relations among features, views and labels: one is a semantic graph representing semantic relations, and the other is a statistical graph. (3) A lightweight Graph Attention Network (GAT) is applied to learn node embedding in the heterogeneous graph as feature saliency scores for ranking and selection. Experimental results on multiple benchmark datasets demonstrate the superiority of our method over state-of-the-art baselines, and it is still effective when applied to small-scale datasets, showcasing its robustness, flexibility, and generalization ability.

[279] Boosting In-Silicon Directed Evolution with Fine-Tuned Protein Language Model and Tree Search

Yaodong Yang, Yang Wang, Jinpeng Li, Pei Guo, Da Han, Guangyong Chen, Pheng-Ann Heng

Main category: cs.AI

TL;DR: AlphaDE is a novel framework that integrates fine-tuned protein language models with Monte Carlo tree search to optimize protein sequences through evolutionary guidance, outperforming state-of-the-art methods.

DetailsMotivation: Current in-silico directed evolution algorithms focus on heuristic search strategies but overlook integrating protein language models with reinforcement learning to directly evolve proteins, missing the rich evolutionary patterns encoded in these models.

Method: 1) Fine-tunes pretrained protein language models using masked language modeling on homologous sequences to activate evolutionary plausibility; 2) Uses test-time inference based on Monte Carlo tree search to evolve proteins with guidance from the fine-tuned model.

Result: Extensive benchmark experiments show AlphaDE remarkably outperforms previous state-of-the-art methods even with few-shot fine-tuning. A case study demonstrates AlphaDE can condense the protein sequence space of avGFP through computational evolution.

Conclusion: AlphaDE successfully bridges the gap between protein language models and reinforcement learning for protein evolution, providing an effective framework that leverages evolutionary guidance from fine-tuned models to optimize protein sequences.

Abstract: Protein evolution through amino acid sequence mutations is a cornerstone of life sciences. While current in-silicon directed evolution algorithms largely focus on designing heuristic search strategies, they overlook how to integrate the transformative protein language models, which encode rich evolutionary patterns, with reinforcement learning to learn to directly evolve proteins. To bridge this gap, we propose AlphaDE, a novel framework to optimize protein sequences by harnessing the innovative paradigms of large language models such as fine-tuning and test-time inference. First, AlphaDE fine-tunes pretrained protein language models using masked language modeling on homologous protein sequences to activate the evolutionary plausibility for the interested protein class. Second, AlphaDE introduces test-time inference based on Monte Carlo tree search, which effectively evolves proteins with evolutionary guidance from the fine-tuned protein language model. Extensive benchmark experiments show that AlphaDE remarkably outperforms previous state-of-the-art methods even with few-shot fine-tuning. A further case study demonstrates that AlphaDE supports condensing the protein sequence space of avGFP through computational evolution.

[280] Intelligent Collaborative Optimization for Rubber Tyre Film Production Based on Multi-path Differentiated Clipping Proximal Policy Optimization

Yinghao Ruan, Wei Pang, Shuaihao Liu, Huili Yang, Leyi Han, Xinghui Dong

Main category: cs.AI

TL;DR: A deep reinforcement learning algorithm called MPD-PPO is introduced for high-dimensional multi-objective optimization in smart tyre manufacturing, addressing dynamic production demands and complex subsystem coordination.

DetailsMotivation: Traditional centralized scheduling and inflexible production lines in rubber tyre industry struggle with dynamic demands and complex subsystem interactions, requiring effective coordination of multiple subsystems.

Method: Multi-path Differentiated Clipping Proximal Policy Optimization (MPD-PPO) with multi-branch policy architecture and differentiated gradient clipping constraints for stable high-dimensional policy updates.

Result: MPD-PPO shows substantial improvements in tuning accuracy and operational efficiency in width and thickness control experiments for rubber tyre film production.

Conclusion: The framework successfully addresses high dimensionality, multi-objective trade-offs, and dynamic adaptation, delivering enhanced performance and production stability for real-time industrial deployment in tyre manufacturing.

Abstract: The advent of smart manufacturing is addressing the limitations of traditional centralized scheduling and inflexible production line configurations in the rubber tyre industry, especially in terms of coping with dynamic production demands. Contemporary tyre manufacturing systems form complex networks of tightly coupled subsystems pronounced nonlinear interactions and emergent dynamics. This complexity renders the effective coordination of multiple subsystems, posing an essential yet formidable task. For high-dimensional, multi-objective optimization problems in this domain, we introduce a deep reinforcement learning algorithm: Multi-path Differentiated Clipping Proximal Policy Optimization (MPD-PPO). This algorithm employs a multi-branch policy architecture with differentiated gradient clipping constraints to ensure stable and efficient high-dimensional policy updates. Validated through experiments on width and thickness control in rubber tyre film production, MPD-PPO demonstrates substantial improvements in both tuning accuracy and operational efficiency. The framework successfully tackles key challenges, including high dimensionality, multi-objective trade-offs, and dynamic adaptation, thus delivering enhanced performance and production stability for real-time industrial deployment in tyre manufacturing.

[281] Incremental Maintenance of DatalogMTL Materialisations

Kaiyue Zhao, Dingqi Chen, Shaoyu Wang, Pan Hu

Main category: cs.AI

TL;DR: DRedMTL is an incremental reasoning algorithm for DatalogMTL that efficiently handles dynamic updates by extending the classical DRed algorithm to work with periodic interval representations.

DetailsMotivation: Existing DatalogMTL reasoning approaches lack support for efficient dynamic updates, which is crucial for real-world applications with frequent data updates.

Method: Extends the classical DRed algorithm with specifically designed operators to handle periodic representations of DatalogMTL materialisations with bounded intervals.

Result: Experimental results show DRedMTL often significantly outperforms rematerialisation, sometimes by orders of magnitude on publicly available datasets.

Conclusion: DRedMTL provides an efficient incremental reasoning solution for DatalogMTL that addresses the limitations of existing approaches in handling dynamic updates.

Abstract: DatalogMTL extends the classical Datalog language with metric temporal logic (MTL), enabling expressive reasoning over temporal data. While existing reasoning approaches, such as materialisation based and automata based methods, offer soundness and completeness, they lack support for handling efficient dynamic updates, a crucial requirement for real-world applications that involve frequent data updates. In this work, we propose DRedMTL, an incremental reasoning algorithm for DatalogMTL with bounded intervals. Our algorithm builds upon the classical DRed algorithm, which incrementally updates the materialisation of a Datalog program. Unlike a Datalog materialisation which is in essence a finite set of facts, a DatalogMTL materialisation has to be represented as a finite set of facts plus periodic intervals indicating how the full materialisation can be constructed through unfolding. To cope with this, our algorithm is equipped with specifically designed operators to efficiently handle such periodic representations of DatalogMTL materialisations. We have implemented this approach and tested it on several publicly available datasets. Experimental results show that DRedMTL often significantly outperforms rematerialisation, sometimes by orders of magnitude.

[282] Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning

Baolong Bi, Shenghua Liu, Yiwei Wang, Siqian Tong, Lingrui Mei, Yuyao Ge, Yilong Xu, Jiafeng Guo, Xueqi Cheng

Main category: cs.AI

TL;DR: RGR-GRPO is a rubric-driven RL framework that uses rubrics to provide fine-grained rewards and offline guidance for multi-domain reasoning, outperforming existing RL methods across 14 benchmarks.

DetailsMotivation: Existing RL methods for LLMs focus on single domains with verifiable rewards and use purely online RL, which restricts exploration space and limits reasoning performance.

Method: Proposes RGR-GRPO framework that leverages rubrics to provide dense reward signals and offline guidance during GRPO training, enabling larger solution space exploration.

Result: Achieves average improvements of +7.0% (math), +5.4% (physics), +8.4% (chemistry), +6.6% (general reasoning) over verifiable online RL baseline across 14 multi-domain benchmarks.

Conclusion: RGR-GRPO enables sustained exploration, stable entropy during training, and superior pass@k performance, effectively breaking through existing performance bottlenecks in multi-domain reasoning.

Abstract: Recent advances in reinforcement learning (RL) have significantly improved the complex reasoning capabilities of large language models (LLMs). Despite these successes, existing methods mainly focus on single-domain RL (e.g., mathematics) with verifiable rewards (RLVR), and their reliance on purely online RL frameworks restricts the exploration space, thereby limiting reasoning performance. In this paper, we address these limitations by leveraging rubrics to provide both fine-grained reward signals and offline guidance. We propose $\textbf{RGR-GRPO}$ (Reward and Guidance through Rubrics), a rubric-driven RL framework for multi-domain reasoning. RGR-GRPO enables LLMs to receive dense and informative rewards while exploring a larger solution space during GRPO training. Extensive experiments across 14 benchmarks spanning multiple domains demonstrate that RGR-GRPO consistently outperforms RL methods that rely solely on alternative reward schemes or offline guidance. Compared with verifiable online RL baseline, RGR-GRPO achieves average improvements of +7.0%, +5.4%, +8.4%, and +6.6% on mathematics, physics, chemistry, and general reasoning tasks, respectively. Notably, RGR-GRPO maintains stable entropy fluctuations during off-policy training and achieves superior pass@k performance, reflecting sustained exploration and effective breakthrough beyond existing performance bottlenecks.

[283] Do Large Language Models (LLMs) Understand Chronology?

Pattaraphon Kenny Wongchamcharoen, Paul Glasserman

Main category: cs.AI

TL;DR: LLMs struggle with chronological reasoning tasks, especially as sequence length increases, but explicit reasoning allocation (like GPT-5 with medium/high effort) significantly improves performance on ordering and anachronism detection tasks.

DetailsMotivation: To test whether LLMs fundamentally understand chronology, which is crucial for their application in finance and economics where look-ahead bias is a concern.

Method: Evaluated multiple LLMs (GPT-4.1, Claude-3.7 Sonnet with/without Extended Thinking, GPT-5) on chronological ordering tasks including: (1) basic chronological ordering, (2) conditional sorting (filter then order), and (3) anachronism detection, across different reasoning-effort settings.

Result: Exact match rates drop sharply with longer sequences, though rank correlations remain high. Most conditional sorting failures come from filtering rather than ordering. GPT-5 and Claude-3.7 Sonnet with Extended Thinking perform best. Anachronism detection is easiest but performance declines with complex timelines.

Conclusion: Explicit reasoning budget allocation helps LLMs handle chronological tasks, with GPT-5 at medium/high reasoning effort achieving perfect performance. Current LLMs have limitations on chronological reasoning that affect their real-time financial applications.

Abstract: Large language models (LLMs) are increasingly used in finance and economics, where prompt-based attempts against look-ahead bias implicitly assume that models understand chronology. We test this fundamental question with a series of chronological ordering tasks with increasing complexities over facts the model already knows from pre-training. Our tasks cover (1) chronological ordering, (2) conditional sorting (filter, then order), and (3) anachronism detection. We evaluate GPT-4.1, Claude-3.7 Sonnet, with and without Extended Thinking (ET), and GPT-5 across multiple reasoning-effort settings. Across models, Exact match rate drops sharply as sequences lengthen even while rank correlations stay high as LLMs largely preserve local order but struggle to maintain a single globally consistent timeline. In conditional sorting, most failures stem from the filtering step rather than the ordering step, but GPT-5 and Claude-3.7 Sonnet with Extended Thinking outshine normal models significantly. Lastly, anachronism detection is found to be the easiest task for the LLMs but performance still declines with increasingly overlapping timelines or entities. Overall, our main contribution is showing that allocating explicit reasoning budget helps with chronological ordering with GPT-5 at medium/high reasoning effort achieving flawless ordering at all lengths and perfect conditional sorting (both self-filtered and given-subset), whereas low/minimal effort degrades with longer lists, mirroring earlier models. Our findings delineate limits of current LLMs on chronological tasks, providing insights into task complexity, and demonstrate scenarios in which reasoning helps. These patterns are important for the real-time application of LLMs in finance. We release all code and evaluation templates to support full reproducibility.

[284] When Words Change the Model: Sensitivity of LLMs for Constraint Programming Modelling

Alessio Pellegrino, Jacopo Mauro

Main category: cs.AI

TL;DR: LLMs can generate models for standard constraint problems but perform poorly when problems are rephrased or perturbed, suggesting their success comes from data contamination rather than genuine reasoning.

DetailsMotivation: To test whether LLMs' apparent success in automatically generating constraint programming models derives from genuine reasoning or data contamination of standard benchmarks.

Method: Systematically rephrased and perturbed well-known CSPLib problems while preserving structure, then compared models generated by three representative LLMs across original and modified descriptions.

Result: LLMs produce syntactically valid models for original problems but performance drops sharply under contextual/linguistic variation, showing shallow understanding and sensitivity to wording.

Conclusion: LLMs’ model generation capabilities are fragile and dependent on training data exposure rather than deep reasoning, limiting their practical applicability for novel problems.

Abstract: One of the long-standing goals in optimisation and constraint programming is to describe a problem in natural language and automatically obtain an executable, efficient model. Large language models appear to bring this vision closer, showing impressive results in automatically generating models for classical benchmarks. However, much of this apparent success may derive from data contamination rather than genuine reasoning: many standard CP problems are likely included in the training data of these models. To examine this hypothesis, we systematically rephrased and perturbed a set of well-known CSPLib problems to preserve their structure while modifying their context and introducing misleading elements. We then compared the models produced by three representative LLMs across original and modified descriptions. Our qualitative analysis shows that while LLMs can produce syntactically valid and semantically plausible models, their performance drops sharply under contextual and linguistic variation, revealing shallow understanding and sensitivity to wording.

cs.SD

[285] OBHS: An Optimized Block Huffman Scheme for Real-Time Audio Compression

Muntahi Safwan Mahfi, Md. Manzurul Hasan, Gahangir Hossain

Main category: cs.SD

TL;DR: OBHS is a lossless audio compression algorithm using block-wise Huffman coding with canonical codes for real-time streaming, achieving up to 93.6% compression with linear time complexity.

DetailsMotivation: To develop an efficient lossless audio compression algorithm suitable for real-time streaming applications that balances high compression ratios with low computational demands.

Method: Partitions audio into fixed-size blocks, constructs optimal Huffman trees for each block, uses canonical code representation for efficient storage, and implements intelligent fallback mechanisms.

Result: Achieves compression ratios up to 93.6% for silence-rich audio and maintains competitive performance across various audio types including pink noise, tones, and real-world recordings.

Conclusion: OBHS effectively balances compression efficiency and computational requirements with linear time complexity, making it highly suitable for resource-constrained real-time audio streaming scenarios.

Abstract: In this paper, we introduce OBHS (Optimized Block Huffman Scheme), a novel lossless audio compression algorithm tailored for real-time streaming applications. OBHS leverages block-wise Huffman coding with canonical code representation and intelligent fallback mechanisms to achieve high compression ratios while maintaining low computational complexity. Our algorithm partitions audio data into fixed-size blocks, constructs optimal Huffman trees for each block, and employs canonical codes for efficient storage and transmission. Experimental results demonstrate that OBHS attains compression ratios of up to 93.6% for silence-rich audio and maintains competitive performance across various audio types, including pink noise, tones, and real-world recordings. With a linear time complexity of O(n) for n audio samples, OBHS effectively balances compression efficiency and computational demands, making it highly suitable for resource-constrained real-time audio streaming scenarios.

[286] IHearYou: Linking Acoustic Features to DSM-5 Depressive Behavior Indicators

Jonas Länzlinger, Katharina Müller, Bruno Rodrigues

Main category: cs.SD

TL;DR: IHearYou is an automated depression detection system that uses speech acoustics and passive sensing to link voice features with DSM-5 indicators for Major Depressive Disorder, running locally for privacy with explainable results.

DetailsMotivation: Current depression diagnosis relies on subjective self-reports and interviews that may not capture authentic behavior, creating a need for objective, automated detection methods.

Method: Uses passive voice sensing in household environments, extracts speech acoustic features, links them to DSM-5 indicators through a structured Linkage Framework, runs locally with privacy protection, and includes FDR correction and gender-stratified testing.

Result: Applied to DAIC-WOZ dataset showing directionally consistent feature-indicator associations, and TESS-based audio streaming experiment validates end-to-end feasibility with real-time throughput on commodity laptops.

Conclusion: Passive voice sensing can be transformed into explainable DSM-5 indicator scores, bridging the gap between black-box detection and clinically interpretable, on-device analysis for depression assessment.

Abstract: Depression affects over millions people worldwide, yet diagnosis still relies on subjective self-reports and interviews that may not capture authentic behavior. We present IHearYou, an approach to automated depression detection focused on speech acoustics. Using passive sensing in household environments, IHearYou extracts voice features and links them to DSM-5 (Diagnostic and Statistical Manual of Mental Disorders) indicators through a structured Linkage Framework instantiated for Major Depressive Disorder. The system runs locally to preserve privacy and includes a persistence schema and dashboard, presenting real-time throughput on a commodity laptop. To ensure reproducibility, we define a configuration-driven protocol with False Discovery Rate (FDR) correction and gender-stratified testing. Applied to the DAIC-WOZ dataset, this protocol reveals directionally consistent feature-indicator associations, while a TESS-based audio streaming experiment validates end-to-end feasibility. Our results show how passive voice sensing can be turned into explainable DSM-5 indicator scores, bridging the gap between black-box detection and clinically interpretable, on-device analysis.

[287] Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech

Nam-Gyu Kim

Main category: cs.SD

TL;DR: SpotlightTTS improves expressive speech synthesis by focusing on voiced regions for style extraction and adjusting style direction for better integration.

DetailsMotivation: Synthesizing high-quality expressive speech remains challenging despite advances in style embedding methods from reference speech.

Method: Uses voiced-aware style extraction focusing on voiced regions related to style while maintaining continuity, and adjusts style direction for optimal TTS integration.

Result: Achieves superior performance compared to baseline models in expressiveness, overall speech quality, and style transfer capability.

Conclusion: SpotlightTTS effectively enhances expressive speech synthesis through focused style extraction and direction adjustment.

Abstract: Recent advances in expressive text-to-speech (TTS) have introduced diverse methods based on style embedding extracted from reference speech. However, synthesizing high-quality expressive speech remains challenging. We propose SpotlightTTS, which exclusively emphasizes style via voiced-aware style extraction and style direction adjustment. Voiced-aware style extraction focuses on voiced regions highly related to style while maintaining continuity across different speech regions to improve expressiveness. We adjust the direction of the extracted style for optimal integration into the TTS model, which improves speech quality. Experimental results demonstrate that Spotlight-TTS achieves superior performance compared to baseline models in terms of expressiveness, overall speech quality, and style transfer capability.

[288] Fine-tuning Pre-trained Audio Models for COVID-19 Detection: A Technical Report

Daniel Oliveira de Brito, Letícia Gabriella de Souza, Marcelo Matheus Gauy, Marcelo Finger, Arnaldo Candido Junior

Main category: cs.SD

TL;DR: Pre-trained audio models show limited COVID-19 detection performance, with severe cross-dataset generalization failure and demographic balancing revealing inflated metrics.

DetailsMotivation: To investigate the performance of pre-trained audio models on COVID-19 detection tasks and evaluate their generalization capabilities across different datasets while controlling for demographic confounding factors.

Method: Fine-tuned Audio-MAE and three PANN architectures (CNN6, CNN10, CNN14) on Coswara and COUGHVID datasets with strict demographic stratification by age and gender to prevent spurious correlations.

Result: Intra-dataset performance was moderate (Coswara: 0.82 AUC, 0.76 F1-score; Coughvid: 0.58-0.63 AUC), but cross-dataset evaluation showed severe generalization failure (AUC 0.43-0.68) with Audio-MAE performance dropping significantly (F1-score 0.00-0.08).

Conclusion: Demographic balancing provides more realistic assessment by eliminating demographic leakage, but limited dataset sizes after balancing (1,219-2,160 samples) are insufficient for deep learning models, highlighting fundamental challenges in developing generalizable audio-based COVID-19 detection systems.

Abstract: This technical report investigates the performance of pre-trained audio models on COVID-19 detection tasks using established benchmark datasets. We fine-tuned Audio-MAE and three PANN architectures (CNN6, CNN10, CNN14) on the Coswara and COUGHVID datasets, evaluating both intra-dataset and cross-dataset generalization. We implemented a strict demographic stratification by age and gender to prevent models from exploiting spurious correlations between demographic characteristics and COVID-19 status. Intra-dataset results showed moderate performance, with Audio-MAE achieving the strongest result on Coswara (0.82 AUC, 0.76 F1-score), while all models demonstrated limited performance on Coughvid (AUC 0.58-0.63). Cross-dataset evaluation revealed severe generalization failure across all models (AUC 0.43-0.68), with Audio-MAE showing strong performance degradation (F1-score 0.00-0.08). Our experiments demonstrate that demographic balancing, while reducing apparent model performance, provides more realistic assessment of COVID-19 detection capabilities by eliminating demographic leakage - a confounding factor that inflate performance metrics. Additionally, the limited dataset sizes after balancing (1,219-2,160 samples) proved insufficient for deep learning models that typically require substantially larger training sets. These findings highlight fundamental challenges in developing generalizable audio-based COVID-19 detection systems and underscore the importance of rigorous demographic controls for clinically robust model evaluation.

[289] Aligning Generative Music AI with Human Preferences: Methods and Challenges

Dorien Herremans, Abhinaba Roy

Main category: cs.SD

TL;DR: This paper advocates for applying preference alignment techniques to music generation to bridge the gap between computational optimization and human musical appreciation, addressing challenges like temporal coherence and subjective quality assessment.

DetailsMotivation: Current generative AI for music achieves high fidelity but fails to align with nuanced human preferences due to specific loss functions, creating a fundamental gap between computational optimization and human musical appreciation.

Method: Systematic application of preference alignment techniques including MusicRL’s large-scale preference learning, multi-preference alignment frameworks like diffusion-based preference optimization in DiffRhythm+, and inference-time optimization techniques like Text2midi-InferAlign.

Result: Identified key research challenges including scalability to long-form compositions and reliability in preference modeling, with potential to address music’s unique challenges like temporal coherence, harmonic consistency, and subjective quality assessment.

Conclusion: Preference-aligned music generation can enable transformative applications in interactive composition tools and personalized music services, requiring sustained interdisciplinary research combining machine learning and music theory to create AI systems that truly serve human creative needs.

Abstract: Recent advances in generative AI for music have achieved remarkable fidelity and stylistic diversity, yet these systems often fail to align with nuanced human preferences due to the specific loss functions they use. This paper advocates for the systematic application of preference alignment techniques to music generation, addressing the fundamental gap between computational optimization and human musical appreciation. Drawing on recent breakthroughs including MusicRL’s large-scale preference learning, multi-preference alignment frameworks like diffusion-based preference optimization in DiffRhythm+, and inference-time optimization techniques like Text2midi-InferAlign, we discuss how these techniques can address music’s unique challenges: temporal coherence, harmonic consistency, and subjective quality assessment. We identify key research challenges including scalability to long-form compositions, reliability amongst others in preference modelling. Looking forward, we envision preference-aligned music generation enabling transformative applications in interactive composition tools and personalized music services. This work calls for sustained interdisciplinary research combining advances in machine learning, music-theory to create music AI systems that truly serve human creative and experiential needs.

[290] LargeSHS: A large-scale dataset of music adaptation

Chih-Pin Tan, Hsuan-Kai Kao, Li Su, Yi-Hsuan Yang

Main category: cs.SD

TL;DR: LargeSHS is a large-scale dataset for reference-based music generation, containing 1.7M metadata entries and 900K audio links with structured adaptation relationships between musical works.

DetailsMotivation: To address the gap in AI-based music generation research, which has focused heavily on text-conditioned models with less attention to reference-based generation like song adaptation.

Method: Created LargeSHS dataset derived from SecondHandSongs, featuring structured adaptation relationships that enable construction of adaptation trees and performance clusters representing cover song families.

Result: A comprehensive dataset with 1.7M metadata entries and 900K publicly accessible audio links, providing unique scale and richness compared to existing datasets.

Conclusion: LargeSHS enables new research in cover song generation, reference-based music generation, and adaptation-aware music information retrieval tasks.

Abstract: Recent advances in AI-based music generation have focused heavily on text-conditioned models, with less attention given to reference-based generation such as song adaptation. To support this line of research, we introduce LargeSHS, a large-scale dataset derived from SecondHandSongs, containing over 1.7 million metadata entries and approximately 900k publicly accessible audio links. Unlike existing datasets, LargeSHS includes structured adaptation relationships between musical works, enabling the construction of adaptation trees and performance clusters that represent cover song families. We provide comprehensive statistics and comparisons with existing datasets, highlighting the unique scale and richness of LargeSHS. This dataset paves the way for new research in cover song generation, reference-based music generation, and adaptation-aware MIR tasks.

[291] A Novel CustNetGC Boosted Model with Spectral Features for Parkinson’s Disease Prediction

Abishek Karthik, Pandiyaraju V, Dominic Savio M, Rohit Swaminathan S

Main category: cs.SD

TL;DR: Novel CustNetGC model combining CNN, Custom Network Grad-CAM and CatBoost achieves 99.06% accuracy for Parkinson’s Disease diagnosis using vocal feature analysis.

DetailsMotivation: Parkinson's disease is difficult to diagnose early, and vocal changes are promising markers for early detection. The paper aims to improve PD diagnosis efficiency using acoustic feature analysis.

Method: Used voice recordings from 81 participants (40 PD, 41 healthy). Extracted L-mHP and Spectral Slopes features from spectrograms using Harmonic-Percussive Source Separation. Developed CustNetGC model combining CNN with Custom Network Grad-CAM for visualization and CatBoost for classification.

Result: Achieved 99.06% accuracy, 95.83% precision, with AUC of 0.90 for PD class and 0.89 for HC class. Model successfully classified PD and non-PD samples with high performance.

Conclusion: CustNetGC system shows potential for improving diagnostic accuracy and interpretability of Parkinson’s Disease prediction models through vocal analysis.

Abstract: Parkinson’s disease is a neurodegenerative disorder that can be very tricky to diagnose and treat. Such early symptoms can include tremors, wheezy breathing, and changes in voice quality as critical indicators of neural damage. Notably, there has been growing interest in utilizing changes in vocal attributes as markers for the detection of PD early on. Based on this understanding, the present paper was designed to focus on the acoustic feature analysis based on voice recordings of patients diagnosed with PD and healthy controls (HC). In this paper, we introduce a novel classification and visualization model known as CustNetGC, combining a Convolutional Neural Network (CNN) with Custom Network Grad-CAM and CatBoost to enhance the efficiency of PD diagnosis. We use a publicly available dataset from Figshare, including voice recordings of 81 participants: 40 patients with PD and 41 healthy controls. From these recordings, we extracted the key spectral features: L-mHP and Spectral Slopes. The L-mHP feature combines three spectrogram representations: Log-Mel spectrogram, harmonic spectrogram, and percussive spectrogram, which are derived using Harmonic-Percussive Source Separation (HPSS). Grad-CAM was used to highlight the important regions in the data, thus making the PD predictions interpretable and effective. Our proposed CustNetGC model achieved an accuracy of 99.06% and precision of 95.83%, with the area under the ROC curve (AUC) recorded at 0.90 for the PD class and 0.89 for the HC class. Additionally, the combination of CatBoost, a gradient boosting algorithm, enhanced the robustness and the prediction performance by properly classifying PD and non-PD samples. Therefore, the results provide the potential improvement in the CustNetGC system in enhancing diagnostic accuracy and the interpretability of the Parkinson’s Disease prediction model.

[292] MelodySim: Measuring Melody-aware Music Similarity for Plagiarism Detection

Tongyu Lu, Charlotta-Marlena Geist, Jan Melechovsky, Abhinaba Roy, Dorien Herremans

Main category: cs.SD

TL;DR: MelodySim is a melody-aware music similarity model and dataset for plagiarism detection, using augmented MIDI data and a triplet neural network to detect melodic similarity.

DetailsMotivation: To address the need for effective melody-based plagiarism detection in music by creating a specialized dataset and model that focuses on melodic similarity rather than overall musical similarity.

Method: 1) Dataset construction: Augment Slakh2100 MIDI dataset with melody-preserving variations (note splitting, arpeggiation, track dropout, re-instrumentation). 2) Model: Segment-wise melodic-similarity detection using MERT encoder and triplet neural network to generate decision matrices.

Result: User study confirms positive pairs contain similar melodies while other tracks are significantly changed. Model outperforms baseline models in detecting similar melodic fragments on the MelodySim test set.

Conclusion: MelodySim provides an effective framework for melody-aware music similarity detection and plagiarism identification through specialized dataset construction and a segment-wise melodic similarity model.

Abstract: We propose MelodySim, a melody-aware music similarity model and dataset for plagiarism detection. First, we introduce a novel method to construct a dataset focused on melodic similarity. By augmenting Slakh2100, an existing MIDI dataset, we generate variations of each piece while preserving the melody through modifications such as note splitting, arpeggiation, minor track dropout, and re-instrumentation. A user study confirms that positive pairs indeed contain similar melodies, while other musical tracks are significantly changed. Second, we develop a segment-wise melodic-similarity detection model that uses a MERT encoder and applies a triplet neural network to capture melodic similarity. The resulting decision matrix highlights where plagiarism might occur. The experiments show that our model is able to outperform baseline models in detecting similar melodic fragments on the MelodySim test set.

[293] AcousTools: A ‘Full-Stack’, Python-Based, Acoustic Holography Library

Joshua Mukherjee, Giorgos Christopoulos, Zhouyang Shen, Sriram Subramanian, Ryuji Hirayama

Main category: cs.SD

TL;DR: AcousTools is a Python-based acoustic holography library that provides a full-stack solution for acoustic holography applications, covering setup, modeling, phase retrieval, analysis, and hardware control.

DetailsMotivation: Existing software libraries for acoustic holography fail to provide a complete solution covering all aspects from abstraction to physicalization, creating a need for a comprehensive framework.

Method: Developed AcousTools as a Python library that supports the full suite of acoustic holographic applications, including setup, acoustic propagation modeling, transducer phase retrieval, sound field analysis, and hardware control.

Result: AcousTools successfully meets each step of the full-stack requirements and provides a uniquely complete feature set in an easy-to-use Python environment.

Conclusion: AcousTools has the potential to become the standard library for acoustic holography, enabling researchers to develop novel applications and accurately review others’ work while providing a framework for methodology comparison.

Abstract: Acoustic Holography is an emerging field where mid-air ultrasound is controlled and manipulated for novel and exciting applications. These range from mid-air haptics, volumetric displays, contactless fabrication, and even chemical and biomedical applications such as drug delivery. To develop these applications, a software framework to predict acoustic behaviour and simulating resulting effects, such as applied forces or scattering patterns is desirable. There have been various software libraries and platforms that attempt to fill this role, but there is yet to be a single piece of software that acts as a ‘full-stack’ solution. We define this full-stack as the process from abstraction to physicalisation starting with setup, modelling acoustic propagation, transducer phase retrieval, sound field analysis, and control of the acoustic holographic hardware itself. Existing methods fail to fulfil one or more of these categories. To address this, we present AcousTools, a Python-based acoustic holography library, designed to support the full suite of acoustic holographic applications and we show AcousTools’s ability to meet each step of the full-stack’s requirements. AcousTools has the potential to become the standard code library for acoustic holography, with the uniquely complete suite of features wrapped in a language that is known to be easy to use, AcousTools will increase the ability for researchers to develop novel applications as well as accurately review other’s work. The full-stack, aside from software, will also be useful for researchers - providing a way to view and compare methodologies by understanding where they fit into the stack.

[294] MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement

Xinyue Yu, Youqing Fang, Pingyu Wu, Guoyang Ye, Wenbo Zhou, Weiming Zhang, Song Xiao

Main category: cs.SD

TL;DR: MF-Speech is a novel framework that addresses speech factor entanglement and coarse control by decomposing speech into pure content, timbre, and emotion representations, enabling fine-grained compositional speech generation.

DetailsMotivation: To overcome the fundamental challenges of deep entanglement of speech factors and coarse granularity of existing control mechanisms in expressive human speech generation.

Method: Two-component framework: MF-SpeechEncoder uses multi-objective optimization to decompose speech into pure content, timbre, and emotion representations; MF-SpeechGenerator achieves precise control through dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN).

Result: Significantly outperforms SOTA methods with WER=4.67%, SECS=0.5685, Corr=0.68, and highest subjective scores (nMOS=3.96, sMOS_emotion=3.86, sMOS_style=3.78). Learned factors show strong transferability.

Conclusion: MF-Speech successfully addresses speech factor disentanglement and enables fine-grained compositional control, with learned representations showing potential as general-purpose speech representations.

Abstract: Generating expressive and controllable human speech is one of the core goals of generative artificial intelligence, but its progress has long been constrained by two fundamental challenges: the deep entanglement of speech factors and the coarse granularity of existing control mechanisms. To overcome these challenges, we have proposed a novel framework called MF-Speech, which consists of two core components: MF-SpeechEncoder and MF-SpeechGenerator. MF-SpeechEncoder acts as a factor purifier, adopting a multi-objective optimization strategy to decompose the original speech signal into highly pure and independent representations of content, timbre, and emotion. Subsequently, MF-SpeechGenerator functions as a conductor, achieving precise, composable and fine-grained control over these factors through dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN). Experiments demonstrate that in the highly challenging multi-factor compositional speech generation task, MF-Speech significantly outperforms current state-of-the-art methods, achieving a lower word error rate (WER=4.67%), superior style control (SECS=0.5685, Corr=0.68), and the highest subjective evaluation scores(nMOS=3.96, sMOS_emotion=3.86, sMOS_style=3.78). Furthermore, the learned discrete factors exhibit strong transferability, demonstrating their significant potential as a general-purpose speech representation.

cs.LG

[295] Transformer Injectivity & Geometric Robustness - Analytic Margins and Bi-Lipschitz Uniformity of Sequence-Level Hidden States

Mikael von Strauss

Main category: cs.LG

TL;DR: Transformers are generically injective (one-to-one) in theory, and this property persists during training. Practical invertibility can be measured using geometric diagnostics like separation margins and co-Lipschitz constants.

DetailsMotivation: To understand when and how decoder-only Transformers maintain injective mappings from prompts to hidden states, and to develop practical diagnostics for measuring this property in real models.

Method: Theoretical analysis of layerwise injectivity with collision discriminants and injective strata, plus empirical study using geometric diagnostics (separation margin and co-Lipschitz constant) on pretrained models (LLaMA-3, Qwen) and a small GPT-2.

Result: Transformers are generically injective in theory, and this persists during training. No collisions found in full precision or 8-bit quantization, but 4-bit quantization causes some collisions and reduces co-Lipschitz estimates. Metrics remain stable during training.

Conclusion: Transformer representations are generically and persistently injective in continuous-parameter settings, and their practical invertibility can be effectively probed using simple geometric diagnostics.

Abstract: Under real-analytic assumptions on decoder-only Transformers, recent work shows that the map from discrete prompts to last-token hidden states is generically injective on finite prompt sets. We refine this picture: for each layer $\ell$ we define a collision discriminant $Δ^\ell \subset Θ$ and injective stratum $U^\ell = Θ\setminus Δ^\ell$, and prove a dichotomy – either the model is nowhere injective on the set, or $U^\ell$ is open and dense and every $F^\ell_θ$ is injective. Under mild non-singularity assumptions on the optimizer and an absolutely continuous initialization, generic injectivity persists along smooth training trajectories over any fixed horizon. We also treat symmetry groups $G$, showing that discriminants and injective strata descend to the quotient $Θ/G$, so injectivity is naturally a property of functional equivalence classes. We complement these results with an empirical study of layerwise geometric diagnostics. We define a separation margin and a co-Lipschitz (lower Lipschitz) constant between prompt space and last-token representation space, estimated via nearest-neighbor statistics on large prompt sets. Applying these diagnostics to pretrained LLaMA-3 and Qwen models, we study behavior across layers, sequence lengths, model scales, and 8- and 4-bit activation quantization. On our sampled prompts we see no collisions in full precision or at 8 bits, while 4-bit quantization induces a small number of collisions and markedly shrinks co-Lipschitz estimates. For a small GPT-2 trained from scratch, normalized metrics remain stable over training. Overall, the results suggest that Transformer representations are generically and persistently injective in the continuous-parameter idealization, while their practical invertibility can be probed using simple geometric diagnostics.

[296] DEVAL: A Framework for Evaluating and Improving the Derivation Capability of Large Language Models

Yifan Li, Qin Li, Min Zhang, Min Zhang, Peixin Wang

Main category: cs.LG

TL;DR: The paper introduces Derivation Capability (DC) - the ability of LLMs to modify outputs based on input changes using abstract rules, and proposes DEVAL framework to evaluate this capability in popular models.

DetailsMotivation: Current LLMs lack comprehensive evaluation for reasoning patterns that involve deriving output modifications from input changes using abstract rules, which is a key aspect of human reasoning.

Method: Proposes DEVAL evaluation framework to assess Derivation Capability, and introduces Derivation Prompting (DP) - a novel prompt engineering approach to improve this capability.

Result: Mainstream LLMs like GPT-4o and Claude3.5 show moderate DR recognition but significant drop-offs in applying DR effectively. DP achieves 15.2% average improvement in DC across all tested LLMs.

Conclusion: Derivation Prompting effectively enhances LLMs’ Derivation Capability, outperforming common prompt engineering techniques and addressing a critical gap in LLM reasoning evaluation.

Abstract: Assessing the reasoning ability of Large Language Models (LLMs) over data remains an open and pressing research question. Compared with LLMs, human reasoning can derive corresponding modifications to the output based on certain kinds of changes to the input. This reasoning pattern, which relies on abstract rules that govern relationships between changes of data, has not been comprehensively described or evaluated in LLMs. In this paper, we formally define this reasoning pattern as the Derivation Relation (DR) and introduce the concept of Derivation Capability (DC), i.e. applying DR by making the corresponding modification to the output whenever the input takes certain changes. To assess DC, a systematically constructed evaluation framework named DEVAL is proposed and used to evaluate five popular LLMs and one Large Reasoning Model in seven mainstream tasks. The evaluation results show that mainstream LLMs, such as GPT-4o and Claude3.5, exhibit moderate DR recognition capabilities but reveal significant drop-offs on applying DR effectively in problem-solving scenarios. To improve this, we propose a novel prompt engineering approach called Derivation Prompting (DP). It achieves an average improvement of 15.2% in DC for all tested LLMs, outperforming commonly used prompt engineering techniques.

[297] Dynamic Nested Hierarchies: Pioneering Self-Evolution in Machine Learning Architectures for Lifelong Intelligence

Akbar Anbar Jafari, Cagri Ozcinar, Gholamreza Anbarjafari

Main category: cs.LG

TL;DR: Dynamic nested hierarchies enable ML models to autonomously adjust optimization levels, nesting structures, and update frequencies during training/inference, overcoming rigid architectures for true lifelong learning.

DetailsMotivation: Address the limitations of current ML models that fail in non-stationary environments due to fixed architectures, preventing continual adaptation and lifelong learning.

Method: Extends nested learning paradigm with dynamic hierarchies that allow autonomous adjustment of optimization levels, nesting structures, and update frequencies, inspired by neuroplasticity principles.

Result: Demonstrates superior performance in language modeling, continual learning, and long-context reasoning with theoretical proofs of convergence, expressivity bounds, and sublinear regret.

Conclusion: Dynamic nested hierarchies represent a foundational advancement toward adaptive, general-purpose intelligence by enabling true lifelong learning through self-evolving architectures.

Abstract: Contemporary machine learning models, including large language models, exhibit remarkable capabilities in static tasks yet falter in non-stationary environments due to rigid architectures that hinder continual adaptation and lifelong learning. Building upon the nested learning paradigm, which decomposes models into multi-level optimization problems with fixed update frequencies, this work proposes dynamic nested hierarchies as the next evolutionary step in advancing artificial intelligence and machine learning. Dynamic nested hierarchies empower models to autonomously adjust the number of optimization levels, their nesting structures, and update frequencies during training or inference, inspired by neuroplasticity to enable self-evolution without predefined constraints. This innovation addresses the anterograde amnesia in existing models, facilitating true lifelong learning by dynamically compressing context flows and adapting to distribution shifts. Through rigorous mathematical formulations, theoretical proofs of convergence, expressivity bounds, and sublinear regret in varying regimes, alongside empirical demonstrations of superior performance in language modeling, continual learning, and long-context reasoning, dynamic nested hierarchies establish a foundational advancement toward adaptive, general-purpose intelligence.

[298] Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization

Yifeng Ding, Hung Le, Songyang Han, Kangrui Ruan, Zhenghui Jin, Varun Kumar, Zijian Wang, Anoop Deoras

Main category: cs.LG

TL;DR: GTPO is a novel RL algorithm that improves multi-turn Tool-Integrated Reasoning for LLMs through turn-level rewards, return-based advantage estimation, and self-supervised reward shaping, outperforming GRPO by 3.0% on reasoning benchmarks.

DetailsMotivation: Existing RL methods like GRPO use coarse-grained trajectory-level rewards that provide insufficient learning signals for complex multi-turn Tool-Integrated Reasoning, leading to training stagnation.

Method: GTPO introduces three innovations: turn-level reward assignment for fine-grained feedback, return-based advantage estimation using normalized discounted returns, and self-supervised reward shaping that densifies sparse binary rewards using code execution signals.

Result: GTPO outperforms GRPO by 3.0% on average across diverse reasoning benchmarks, demonstrating improved performance for complex mathematical reasoning tasks.

Conclusion: GTPO effectively addresses the limitations of existing RL approaches for multi-turn TIR tasks and advances complex mathematical reasoning capabilities in real-world applications.

Abstract: Training Large Language Models (LLMs) for multi-turn Tool-Integrated Reasoning (TIR) - where models iteratively reason, generate code, and verify through execution - remains challenging for existing reinforcement learning (RL) approaches. Current RL methods, exemplified by Group Relative Policy Optimization (GRPO), suffer from coarse-grained, trajectory-level rewards that provide insufficient learning signals for complex multi-turn interactions, leading to training stagnation. To address this issue, we propose Group Turn Policy Optimization (GTPO), a novel RL algorithm specifically designed for training LLMs on multi-turn TIR tasks. GTPO introduces three key innovations: (1) turn-level reward assignment that provides fine-grained feedback for individual turns, (2) return-based advantage estimation where normalized discounted returns are calculated as advantages, and (3) self-supervised reward shaping that exploits self-supervision signals from generated code to densify sparse binary outcome-based rewards. Our comprehensive evaluation demonstrates that GTPO outperforms GRPO by 3.0% on average across diverse reasoning benchmarks, establishing its effectiveness for advancing complex mathematical reasoning in the real world.

[299] FinTRec: Transformer Based Unified Contextual Ads Targeting and Personalization for Financial Applications

Dwipam Katariya, Snehita Varma, Akshat Shreemali, Benjamin Wu, Kalanand Mishra, Pranab Mohanty

Main category: cs.LG

TL;DR: FinTRec is a transformer-based framework for financial services recommendation that addresses challenges of long-range user interactions and multiple interrelated products, outperforming traditional tree-based models while enabling cross-product signal sharing.

DetailsMotivation: Transformer architectures face practical challenges in financial services including long-range user interactions across digital/physical channels and the need for coordinated models for multiple products with competing business goals.

Method: Proposed FinTRec framework using transformer architecture for sequential recommendation, with fine-tuning for product adaptation to enable cross-product signal sharing.

Result: FinTRec consistently outperforms production-grade tree-based baseline in historic simulations and live A/B tests, improves offline performance across all products while reducing training costs and technical debt.

Conclusion: FinTRec demonstrates transformer-based architectures are viable and effective for financial services recommendation, offering the first comprehensive study addressing both technical and business considerations in this domain.

Abstract: Transformer-based architectures are widely adopted in sequential recommendation systems, yet their application in Financial Services (FS) presents distinct practical and modeling challenges for real-time recommendation. These include:a) long-range user interactions (implicit and explicit) spanning both digital and physical channels generating temporally heterogeneous context, b) the presence of multiple interrelated products require coordinated models to support varied ad placements and personalized feeds, while balancing competing business goals. We propose FinTRec, a transformer-based framework that addresses these challenges and its operational objectives in FS. While tree-based models have traditionally been preferred in FS due to their explainability and alignment with regulatory requirements, our study demonstrate that FinTRec offers a viable and effective shift toward transformer-based architectures. Through historic simulation and live A/B test correlations, we show FinTRec consistently outperforms the production-grade tree-based baseline. The unified architecture, when fine-tuned for product adaptation, enables cross-product signal sharing, reduces training cost and technical debt, while improving offline performance across all products. To our knowledge, this is the first comprehensive study of unified sequential recommendation modeling in FS that addresses both technical and business considerations.

[300] Transformer-Guided Deep Reinforcement Learning for Optimal Takeoff Trajectory Design of an eVTOL Drone

Nathan M. Roberts, Xiaosong Du

Main category: cs.LG

TL;DR: Transformer-guided DRL improves training efficiency for eVTOL takeoff trajectory optimization, achieving 25% faster training and 97.2% accuracy in energy consumption.

DetailsMotivation: To address the training difficulty bottleneck in deep reinforcement learning for complex eVTOL trajectory optimization, while overcoming limitations of conventional optimal control methods.

Method: Proposed transformer-guided DRL that uses a transformer to explore realistic state space at each time step for optimal takeoff trajectory design with power and wing angle controls.

Result: Transformer-guided DRL learned in 4.57×10^6 steps (25% of vanilla DRL’s 19.79×10^6 steps) and achieved 97.2% accuracy in optimal energy consumption vs 96.3% for vanilla DRL.

Conclusion: Transformer-guided DRL outperforms vanilla DRL in both training efficiency and optimal design verification for eVTOL takeoff trajectory optimization.

Abstract: The rapid advancement of electric vertical take-off and landing (eVTOL) aircraft offers a promising opportunity to alleviate urban traffic congestion. Thus, developing optimal takeoff trajectories for minimum energy consumption becomes essential for broader eVTOL aircraft applications. Conventional optimal control methods (such as dynamic programming and linear quadratic regulator) provide highly efficient and well-established solutions but are limited by problem dimensionality and complexity. Deep reinforcement learning (DRL) emerges as a special type of artificial intelligence tackling complex, nonlinear systems; however, the training difficulty is a key bottleneck that limits DRL applications. To address these challenges, we propose the transformer-guided DRL to alleviate the training difficulty by exploring a realistic state space at each time step using a transformer. The proposed transformer-guided DRL was demonstrated on an optimal takeoff trajectory design of an eVTOL drone for minimal energy consumption while meeting takeoff conditions (i.e., minimum vertical displacement and minimum horizontal velocity) by varying control variables (i.e., power and wing angle to the vertical). Results presented that the transformer-guided DRL agent learned to take off with $4.57\times10^6$ time steps, representing 25% of the $19.79\times10^6$ time steps needed by a vanilla DRL agent. In addition, the transformer-guided DRL achieved 97.2% accuracy on the optimal energy consumption compared against the simulation-based optimal reference while the vanilla DRL achieved 96.3% accuracy. Therefore, the proposed transformer-guided DRL outperformed vanilla DRL in terms of both training efficiency as well as optimal design verification.

[301] Bringing Federated Learning to Space

Grace Kim, Filip Svoboda, Nicholas Lane

Main category: cs.LG

TL;DR: First systematic feasibility analysis of adapting federated learning for satellite constellations, showing space-adapted FL algorithms can scale to 100 satellites with 9x speedup through orbital scheduling.

DetailsMotivation: LEO satellite constellations are rapidly expanding, creating downlink bandwidth limitations that require distributed on-board machine learning to enable more autonomous and data-driven satellite operations.

Method: Introduced a ‘space-ification’ framework to adapt terrestrial FL algorithms (FedAvg, FedProx, FedBuff) for orbital constraints, then evaluated through 768 constellation configurations varying cluster sizes, satellites per cluster, and ground station networks.

Result: Space-adapted FL algorithms efficiently scale to constellations of up to 100 satellites, achieving performance close to centralized ideal. Multi-month training cycles can be reduced to days with 9x speedup through orbital scheduling and local coordination.

Conclusion: The analysis provides actionable insights for mission designers to enable distributed on-board learning for more autonomous, resilient satellite operations in large constellations.

Abstract: As Low Earth Orbit (LEO) satellite constellations rapidly expand to hundreds and thousands of spacecraft, the need for distributed on-board machine learning becomes critical to address downlink bandwidth limitations. Federated learning (FL) offers a promising framework to conduct collaborative model training across satellite networks. Realizing its benefits in space naturally requires addressing space-specific constraints, from intermittent connectivity to dynamics imposed by orbital motion. This work presents the first systematic feasibility analysis of adapting off-the-shelf FL algorithms for satellite constellation deployment. We introduce a comprehensive “space-ification” framework that adapts terrestrial algorithms (FedAvg, FedProx, FedBuff) to operate under orbital constraints, producing an orbital-ready suite of FL algorithms. We then evaluate these space-ified methods through extensive parameter sweeps across 768 constellation configurations that vary cluster sizes (1-10), satellites per cluster (1-10), and ground station networks (1-13). Our analysis demonstrates that space-adapted FL algorithms efficiently scale to constellations of up to 100 satellites, achieving performance close to the centralized ideal. Multi-month training cycles can be reduced to days, corresponding to a 9x speedup through orbital scheduling and local coordination within satellite clusters. These results provide actionable insights for future mission designers, enabling distributed on-board learning for more autonomous, resilient, and data-driven satellite operations.

[302] It’s LIT! Reliability-Optimized LLMs with Inspectable Tools

Ruixin Zhang, Jon Donnelly, Zhicheng Guo, Ghazal Khalighinejad, Haiyang Huang, Alina Jade Barnett, Cynthia Rudin

Main category: cs.LG

TL;DR: LIT framework forces LLMs to use reliable external tools for problem-solving, improving trustworthiness while maintaining performance.

DetailsMotivation: LLMs have opaque reasoning processes and can choose unreliable solutions, limiting their usefulness in high-stakes domains where trust and troubleshootability are critical.

Method: Built on existing LLM tool-calling capabilities, LIT framework uses reliability cost functions to guide LLMs in selecting the most reliable and easy-to-troubleshoot solution paths involving multiple sequential tool calls.

Result: LLMs achieve more reliable and informed problem-solving while maintaining task performance, demonstrated on a benchmark of 1,300 questions using specialized tools interacting with patent and research paper datasets.

Conclusion: The LIT framework successfully enables LLMs to make more trustworthy decisions by prioritizing reliable and inspectable tools, addressing the opacity and reliability issues in LLM reasoning.

Abstract: Large language models (LLMs) have exhibited remarkable capabilities across various domains. The ability to call external tools further expands their capability to handle real-world tasks. However, LLMs often follow an opaque reasoning process, which limits their usefulness in high-stakes domains where solutions need to be trustworthy to end users. LLMs can choose solutions that are unreliable and difficult to troubleshoot, even if better options are available. We address this issue by forcing LLMs to use external – more reliable – tools to solve problems when possible. We present a framework built on the tool-calling capabilities of existing LLMs to enable them to select the most reliable and easy-to-troubleshoot solution path, which may involve multiple sequential tool calls. We refer to this framework as LIT (LLMs with Inspectable Tools). In order to support LIT, we introduce a new and challenging benchmark dataset of 1,300 questions and a customizable set of reliability cost functions associated with a collection of specialized tools. These cost functions summarize how reliable each tool is and how easy it is to troubleshoot. For instance, a calculator is reliable across domains, whereas a linear prediction model is not reliable if there is distribution shift, but it is easy to troubleshoot. A tool that constructs a random forest is neither reliable nor easy to troubleshoot. These tools interact with the Harvard USPTO Patent Dataset and a new dataset of NeurIPS 2023 papers to solve mathematical, coding, and modeling problems of varying difficulty levels. We demonstrate that LLMs can achieve more reliable and informed problem-solving while maintaining task performance using our framework.

[303] Model Merging Improves Zero-Shot Generalization in Bioacoustic Foundation Models

Davide Marincione, Donato Crisostomi, Roberto Dessi, Emanuele Rodolà, Emanuele Rossi

Main category: cs.LG

TL;DR: NatureLM, a bioacoustics foundation model, shows strong domain performance but struggles with complex instructions. Model merging with its base language model recovers instruction-following while maintaining domain expertise, achieving 200%+ improvement in zero-shot classification.

DetailsMotivation: To address the trade-off between domain-specific performance and instruction-following flexibility in bioacoustic foundation models like NatureLM, which performs well on single tasks but fails on complex multi-task prompts.

Method: Applied a simple model merging strategy that interpolates NatureLM with its base language model to balance domain expertise and general instruction-following capabilities.

Result: The merged model recovered instruction-following capabilities with minimal domain knowledge loss and achieved over 200% relative improvement in zero-shot generalization, setting new state-of-the-art in closed-set zero-shot classification of unseen species.

Conclusion: Model merging effectively resolves the instruction-following limitations of domain-specialized foundation models while enhancing zero-shot generalization capabilities in bioacoustics.

Abstract: Foundation models capable of generalizing across species and tasks represent a promising new frontier in bioacoustics, with NatureLM being one of the most prominent examples. While its domain-specific fine-tuning yields strong performance on bioacoustic benchmarks, we observe that it also introduces trade-offs in instruction-following flexibility. For instance, NatureLM achieves high accuracy when prompted for either the common or scientific name individually, but its accuracy drops significantly when both are requested in a single prompt. We address this by applying a simple model merging strategy that interpolates NatureLM with its base language model, recovering instruction-following capabilities with minimal loss of domain expertise. Finally, we show that the merged model exhibits markedly stronger zero-shot generalization, achieving over a 200% relative improvement and setting a new state-of-the-art in closed-set zero-shot classification of unseen species.

[304] Structured Contrastive Learning for Interpretable Latent Representations

Zhengyang Shen, Hua Tu, Mayue Shi

Main category: cs.LG

TL;DR: SCL partitions latent spaces into invariant, variant, and free features to handle semantic transformations, improving ECG phase shift robustness (0.25→0.91 similarity) and IMU rotation performance (86.65% accuracy) without architectural changes.

DetailsMotivation: Neural networks are brittle to semantically irrelevant transformations like ECG phase shifts and IMU rotations, where latent representations degrade significantly despite task performance being maintained.

Method: Structured Contrastive Learning (SCL) partitions latent space into three semantic groups: invariant features (consistent under transformations), variant features (differentiate transformations), and free features (preserve task flexibility), creating controllable push-pull dynamics.

Result: ECG similarity improved from 0.25 to 0.91 under phase shifts; WISDM activity recognition achieved 86.65% accuracy with 95.38% rotation consistency, outperforming traditional data augmentation methods.

Conclusion: SCL represents a paradigm shift from reactive data augmentation to proactive structural learning, enabling interpretable latent representations that maintain robustness to semantic transformations while preserving task performance.

Abstract: Neural networks exhibit severe brittleness to semantically irrelevant transformations. A mere 75ms electrocardiogram (ECG) phase shift degrades latent cosine similarity from 1.0 to 0.2, while sensor rotations collapse activity recognition performance with inertial measurement units (IMUs). We identify the root cause as “laissez-faire” representation learning, where latent spaces evolve unconstrained provided task performance is satisfied. We propose Structured Contrastive Learning (SCL), a framework that partitions latent space representations into three semantic groups: invariant features that remain consistent under given transformations (e.g., phase shifts or rotations), variant features that actively differentiate transformations via a novel variant mechanism, and free features that preserve task flexibility. This creates controllable push-pull dynamics where different latent dimensions serve distinct, interpretable purposes. The variant mechanism enhances contrastive learning by encouraging variant features to differentiate within positive pairs, enabling simultaneous robustness and interpretability. Our approach requires no architectural modifications and integrates seamlessly into existing training pipelines. Experiments on ECG phase invariance and IMU rotation robustness demonstrate superior performance: ECG similarity improves from 0.25 to 0.91 under phase shifts, while WISDM activity recognition achieves 86.65% accuracy with 95.38% rotation consistency, consistently outperforming traditional data augmentation. This work represents a paradigm shift from reactive data augmentation to proactive structural learning, enabling interpretable latent representations in neural networks.

[305] Regularized Schrödinger Bridge: Alleviating Distortion and Exposure Bias in Solving Inverse Problems

Qing Yao, Lijian Gao, Qirong Mao, Ming Dong

Main category: cs.LG

TL;DR: RSB addresses distortion-perception tradeoff and exposure bias in diffusion models for inverse problems using regularized Schrödinger Bridge with perturbed training strategy.

DetailsMotivation: To overcome distortion-perception tradeoff and exposure bias problems in diffusion models for inverse problems, where perceptual quality improvement degrades reconstruction fidelity and training-inference mismatch reduces quality.

Method: Proposes Regularized Schrödinger Bridge (RSB) with novel regularized training strategy that perturbs both input states and targets, exposing model to simulated prediction errors and using posterior mean interpolation.

Result: RSB outperforms state-of-the-art methods on speech enhancement tasks, significantly improving distortion metrics and effectively reducing exposure bias.

Conclusion: RSB successfully addresses key limitations of diffusion models for inverse problems through its regularized training approach and Schrödinger Bridge adaptation.

Abstract: Diffusion models serve as a powerful generative framework for solving inverse problems. However, they still face two key challenges: 1) the distortion-perception tradeoff, where improving perceptual quality often degrades reconstruction fidelity, and 2) the exposure bias problem, where the training-inference input mismatch leads to prediction error accumulation and reduced reconstruction quality. In this work, we propose the Regularized Schrödinger Bridge (RSB), an adaptation of Schrödinger Bridge tailored for inverse problems that addresses the above limitations. RSB employs a novel regularized training strategy that perturbs both the input states and targets, effectively mitigating exposure bias by exposing the model to simulated prediction errors and also alleviating distortion by well-designed interpolation via the posterior mean. Extensive experiments on two typical inverse problems for speech enhancement demonstrate that RSB outperforms state-of-the-art methods, significantly improving distortion metrics and effectively reducing exposure bias.

[306] To Align or Not to Align: Strategic Multimodal Representation Alignment for Optimal Performance

Wanlong Fang, Tianle Zhang, Alvin Chan

Main category: cs.LG

TL;DR: Explicit alignment between multimodal representations has mixed effects on performance depending on data redundancy - optimal alignment strength balances modality-specific signals and shared information.

DetailsMotivation: Prior research only observed natural alignment without systematically studying effects of explicit alignment. Need to understand when explicit alignment helps or hinders performance under different modality information structures.

Method: Introduced controllable contrastive learning module to precisely manipulate alignment strength during training. Tested on synthetic and real datasets with different data characteristics.

Result: Impact of explicit alignment depends on data redundancy between modalities. Identified optimal alignment strength that balances modality-specific signals and shared redundancy in mixed information distributions.

Conclusion: Provides practical guidance on when and how to apply explicit alignment for optimal unimodal encoder performance based on modality redundancy characteristics.

Abstract: Multimodal learning often relies on aligning representations across modalities to enable effective information integration, an approach traditionally assumed to be universally beneficial. However, prior research has primarily taken an observational approach, examining naturally occurring alignment in multimodal data and exploring its correlation with model performance, without systematically studying the direct effects of explicitly enforced alignment between representations of different modalities. In this work, we investigate how explicit alignment influences both model performance and representation alignment under different modality-specific information structures. Specifically, we introduce a controllable contrastive learning module that enables precise manipulation of alignment strength during training, allowing us to explore when explicit alignment improves or hinders performance. Our results on synthetic and real datasets under different data characteristics show that the impact of explicit alignment on the performance of unimodal models is related to the characteristics of the data: the optimal level of alignment depends on the amount of redundancy between the different modalities. We identify an optimal alignment strength that balances modality-specific signals and shared redundancy in the mixed information distributions. This work provides practical guidance on when and how explicit alignment should be applied to achieve optimal unimodal encoder performance.

[307] Integrating Causal Inference with Graph Neural Networks for Alzheimer’s Disease Analysis

Pranay Kumar Peddi, Dhrubajyoti Ghosh

Main category: cs.LG

TL;DR: Causal-GCN is a causal graph convolutional framework that uses do-calculus-based back-door adjustment to identify brain regions with stable causal influence on Alzheimer’s disease progression, addressing confounding factors like age, sex, and APOE4 genotype.

DetailsMotivation: Most deep graph learning models for Alzheimer's disease classification from MRI remain correlational and confound demographic/genetic factors with disease-specific features, limiting causal interpretability.

Method: Represent MRI as structural connectomes with brain regions as nodes and anatomical connectivity as edges. Use principal components to summarize confounders (age, sex, APOE4) and include them in causal adjustment set. Simulate interventions on individual regions by severing incoming edges and altering node features to estimate average causal effects.

Result: Applied to 484 ADNI subjects, Causal-GCN achieves performance comparable to baseline GNNs while providing interpretable causal effect rankings that highlight posterior, cingulate, and insular hubs consistent with established AD neuropathology.

Conclusion: Causal-GCN successfully integrates causal inference with graph learning to identify brain regions with stable causal influence on AD progression, offering both competitive performance and enhanced interpretability.

Abstract: Deep graph learning has advanced Alzheimer’s (AD) disease classification from MRI, but most models remain correlational, confounding demographic and genetic factors with disease specific features. We present Causal-GCN, an interventional graph convolutional framework that integrates do-calculus-based back-door adjustment to identify brain regions exerting stable causal influence on AD progression. Each subject’s MRI is represented as a structural connectome where nodes denote cortical and subcortical regions and edges encode anatomical connectivity. Confounders such as age, sec, and APOE4 genotype are summarized via principal components and included in the causal adjustment set. After training, interventions on individual regions are simulated by serving their incoming edges and altering node features to estimate average causal effects on disease probability. Applied to 484 subjects from the ADNI cohort, Causal-GCN achieves performance comparable to baseline GNNs while providing interpretable causal effect rankings that highlight posterior, cingulate, and insular hubs consistent with established AD neuropathology.

[308] How to Train Private Clinical Language Models: A Comparative Study of Privacy-Preserving Pipelines for ICD-9 Coding

Mathieu Dufour, Andrew Duncan

Main category: cs.LG

TL;DR: Knowledge distillation from DP-trained teachers outperforms other privacy methods for clinical diagnostic coding, recovering 63% of non-private performance while maintaining strong privacy.

DetailsMotivation: Large language models risk exposing sensitive patient data, but current differential privacy methods severely degrade diagnostic accuracy needed for clinical deployment.

Method: Systematic comparison of four training pipelines for automated diagnostic coding using identical 1B-parameter models and matched privacy budgets to predict ICD-9 codes from hospital discharge summaries.

Result: At moderate privacy budgets (ε=4,6), knowledge distillation from DP-trained teachers outperforms direct DP-SGD and DP-synthetic data training, recovering up to 63% of non-private performance with strong empirical privacy (membership-inference AUC ≈ 0.5).

Conclusion: Knowledge distillation is identified as the most practical route to privacy-preserving clinical NLP, with large differences in privacy-utility trade-off across architectures.

Abstract: Large language models trained on clinical text risk exposing sensitive patient information, yet differential privacy (DP) methods often severely degrade the diagnostic accuracy needed for deployment. Despite rapid progress in DP optimisation and text generation, it remains unclear which privacy-preserving strategy actually works best for clinical language tasks. We present the first systematic head-to-head comparison of four training pipelines for automated diagnostic coding from hospital discharge summaries. All pipelines use identical 1B-parameter models and matched privacy budgets to predict ICD-9 codes. At moderate and relaxed privacy budgets ($\varepsilon \in {4, 6}$), knowledge distillation from DP-trained teachers outperforms both direct DP-SGD and DP-synthetic data training, recovering up to 63% of the non-private performance whilst maintaining strong empirical privacy (membership-inference AUC $\approx$ 0.5). These findings expose large differences in the privacy-utility trade-off across architectures and identify knowledge distillation as the most practical route to privacy-preserving clinical NLP.

[309] Knowledge Graphs as Structured Memory for Embedding Spaces: From Training Clusters to Explainable Inference

Artur A. Oliveira, Mateus Espadoto, Roberto M. Cesar, Roberto Hirata

Main category: cs.LG

TL;DR: Graph Memory (GM) is a structured non-parametric framework that augments embedding-based inference with relational memory over region-level prototypes, achieving competitive accuracy with better calibration and smoother decision boundaries using fewer samples.

DetailsMotivation: To bridge local evidence and global consistency in non-parametric learning by summarizing embedding space into structured prototypes with reliability indicators and relational edges, rather than treating training instances in isolation.

Method: GM summarizes embedding space into prototype nodes with reliability indicators, connected by edges encoding geometric and contextual relations. It unifies instance retrieval, prototype-based reasoning, and graph-based label propagation in a single inductive model.

Result: GM achieves accuracy competitive with kNN and Label Spreading while offering substantially better calibration and smoother decision boundaries, all with an order of magnitude fewer samples. Experiments on synthetic and real datasets including breast histopathology (IDC) demonstrate these benefits.

Conclusion: By explicitly modeling reliability and relational structure, GM provides a principled bridge between local evidence and global consistency in non-parametric learning, supporting both efficient inference and faithful explanation.

Abstract: We introduce Graph Memory (GM), a structured non-parametric framework that augments embedding-based inference with a compact, relational memory over region-level prototypes. Rather than treating each training instance in isolation, GM summarizes the embedding space into prototype nodes annotated with reliability indicators and connected by edges that encode geometric and contextual relations. This design unifies instance retrieval, prototype-based reasoning, and graph-based label propagation within a single inductive model that supports both efficient inference and faithful explanation. Experiments on synthetic and real datasets including breast histopathology (IDC) show that GM achieves accuracy competitive with $k$NN and Label Spreading while offering substantially better calibration and smoother decision boundaries, all with an order of magnitude fewer samples. By explicitly modeling reliability and relational structure, GM provides a principled bridge between local evidence and global consistency in non-parametric learning.

[310] IonCast: A Deep Learning Framework for Forecasting Ionospheric Dynamics

Halil S. Kelebek, Linnea M. Wolniewicz, Michael D. Vergalla, Simone Mestici, Giacomo Acciarini, Bala Poduval, Olga Verkhoglyadova, Madhulika Guhathakurta, Thomas E. Berger, Frank Soboczenski, Atılım Güneş Baydin

Main category: cs.LG

TL;DR: IonCast is a deep learning suite using GraphCast-inspired models to forecast global ionospheric Total Electron Content (TEC) with improved accuracy over persistence models, integrating multiple physical drivers and observational data.

DetailsMotivation: Accurate ionospheric forecasting is crucial for GNSS accuracy, high-frequency communications, and aviation operations, but current capabilities have gaps that need addressing.

Method: Uses GraphCast-inspired deep learning models with spatiotemporal learning, integrating diverse physical drivers and observational datasets through scalable graph-based approaches.

Result: Validation shows improved forecasting skill compared to persistence models during both storm-time and quiet ionospheric conditions.

Conclusion: IonCast demonstrates how machine learning can enhance physical understanding of ionospheric variability and improve operational space weather resilience by unifying heterogeneous data with scalable spatiotemporal learning.

Abstract: The ionosphere is a critical component of near-Earth space, shaping GNSS accuracy, high-frequency communications, and aviation operations. For these reasons, accurate forecasting and modeling of ionospheric variability has become increasingly relevant. To address this gap, we present IonCast, a suite of deep learning models that include a GraphCast-inspired model tailored for ionospheric dynamics. IonCast leverages spatiotemporal learning to forecast global Total Electron Content (TEC), integrating diverse physical drivers and observational datasets. Validating on held-out storm-time and quiet conditions highlights improved skill compared to persistence. By unifying heterogeneous data with scalable graph-based spatiotemporal learning, IonCast demonstrates how machine learning can augment physical understanding of ionospheric variability and advance operational space weather resilience.

[311] Simulated Human Learning in a Dynamic, Partially-Observed, Time-Series Environment

Jeffrey Jiang, Kevin Hong, Emily Kuczynski, Gregory Pottie

Main category: cs.LG

TL;DR: Reinforcement learning intelligent tutoring systems that combine individual student state estimation with population information through probing interventions, showing similar performance to heuristic approaches but struggling with harder classes.

DetailsMotivation: Intelligent tutoring systems need to personalize instruction for unique students in partially observable learning environments, requiring a balance between gathering information through interventions and avoiding disruption.

Method: Developed a dynamic time-series simulation environment for classrooms with various interventions (tutoring, lectures, exams), and implemented RL ITSs that use probing interventions to estimate student states while leveraging population information.

Result: RL algorithms and greedy heuristic approaches provide different solutions with similar performance. Probing interventions boost performance, especially in quiz and midterm course structures, but RL policies struggle with harder classes.

Conclusion: Probing interventions effectively reduce student estimation difficulty, and both heuristic and RL approaches show flexibility across different student populations, though additional information from frequent assessments provides significant benefits over finals-only structures.

Abstract: While intelligent tutoring systems (ITSs) can use information from past students to personalize instruction, each new student is unique. Moreover, the education problem is inherently difficult because the learning process is only partially observable. We therefore develop a dynamic, time-series environment to simulate a classroom setting, with student-teacher interventions - including tutoring sessions, lectures, and exams. In particular, we design the simulated environment to allow for varying levels of probing interventions that can gather more information. Then, we develop reinforcement learning ITSs that combine learning the individual state of students while pulling from population information through the use of probing interventions. These interventions can reduce the difficulty of student estimation, but also introduce a cost-benefit decision to find a balance between probing enough to get accurate estimates and probing so often that it becomes disruptive to the student. We compare the efficacy of standard RL algorithms with several greedy rules-based heuristic approaches to find that they provide different solutions, but with similar results. We also highlight the difficulty of the problem with increasing levels of hidden information, and the boost that we get if we allow for probing interventions. We show the flexibility of both heuristic and RL policies with regards to changing student population distributions, finding that both are flexible, but RL policies struggle to help harder classes. Finally, we test different course structures with non-probing policies and we find that our policies are able to boost the performance of quiz and midterm structures more than we can in a finals-only structure, highlighting the benefit of having additional information.

[312] FLARE: Adaptive Multi-Dimensional Reputation for Robust Client Reliability in Federated Learning

Abolfazl Younesi, Leon Kiss, Zahra Najafabadi Samani, Juan Aznar Poveda, Thomas Fahringer

Main category: cs.LG

TL;DR: FLARE is an adaptive reputation-based framework for federated learning that provides continuous multi-dimensional trust evaluation of clients, enabling robust defense against Byzantine attacks while maintaining model performance and convergence.

DetailsMotivation: Existing federated learning defenses use static thresholds and binary classification, making them vulnerable to evolving adversarial behaviors and failing to adapt to real-world deployment conditions.

Method: FLARE integrates multi-dimensional reputation scoring, self-calibrating adaptive thresholds, reputation-weighted aggregation with soft exclusion, and local differential privacy for secure client evaluation.

Result: FLARE maintains high model accuracy, converges faster than state-of-the-art methods, improves robustness by up to 16%, and preserves convergence within 30% of non-attacked baselines across MNIST, CIFAR-10, and SVHN datasets.

Conclusion: FLARE provides an effective adaptive defense framework for federated learning that balances security and performance while being resilient to sophisticated attacks like Statistical Mimicry.

Abstract: Federated learning (FL) enables collaborative model training while preserving data privacy. However, it remains vulnerable to malicious clients who compromise model integrity through Byzantine attacks, data poisoning, or adaptive adversarial behaviors. Existing defense mechanisms rely on static thresholds and binary classification, failing to adapt to evolving client behaviors in real-world deployments. We propose FLARE, an adaptive reputation-based framework that transforms client reliability assessment from binary decisions to a continuous, multi-dimensional trust evaluation. FLARE integrates: (i) a multi-dimensional reputation score capturing performance consistency, statistical anomaly indicators, and temporal behavior, (ii) a self-calibrating adaptive threshold mechanism that adjusts security strictness based on model convergence and recent attack intensity, (iii) reputation-weighted aggregation with soft exclusion to proportionally limit suspicious contributions rather than eliminating clients outright, and (iv) a Local Differential Privacy (LDP) mechanism enabling reputation scoring on privatized client updates. We further introduce a highly evasive Statistical Mimicry (SM) attack, a benchmark adversary that blends honest gradients with synthetic perturbations and persistent drift to remain undetected by traditional filters. Extensive experiments with 100 clients on MNIST, CIFAR-10, and SVHN demonstrate that FLARE maintains high model accuracy and converges faster than state-of-the-art Byzantine-robust methods under diverse attack types, including label flipping, gradient scaling, adaptive attacks, ALIE, and SM. FLARE improves robustness by up to 16% and preserves model convergence within 30% of the non-attacked baseline, while achieving strong malicious-client detection performance with minimal computational overhead. https://github.com/Anonymous0-0paper/FLARE

[313] Oversampling techniques for predicting COVID-19 patient length of stay

Zachariah Farahany, Jiawei Wu, K M Sajjadul Islam, Praveen Madiraju

Main category: cs.LG

TL;DR: Using EHR data and LOS as severity measure, this paper addresses COVID-19 severity prediction as an imbalanced classification problem through oversampling and ANN with Bayesian optimization.

DetailsMotivation: COVID-19 severity varies widely, with some patients experiencing lengthy hospital stays or death. Predicting severity using EHR data can help in resource allocation and treatment planning.

Method: Synthetic oversampling of training data to handle class imbalance, followed by Artificial Neural Network with hyperparameter tuning using Bayesian optimization.

Result: Model selection based on best F1 score, with evaluation and discussion of the selected model’s performance.

Conclusion: The approach effectively addresses the imbalanced classification problem in COVID-19 severity prediction using EHR data and advanced machine learning techniques.

Abstract: COVID-19 is a respiratory disease that caused a global pandemic in 2019. It is highly infectious and has the following symptoms: fever or chills, cough, shortness of breath, fatigue, muscle or body aches, headache, the new loss of taste or smell, sore throat, congestion or runny nose, nausea or vomiting, and diarrhea. These symptoms vary in severity; some people with many risk factors have been known to have lengthy hospital stays or die from the disease. In this paper, we analyze patients’ electronic health records (EHR) to predict the severity of their COVID-19 infection using the length of stay (LOS) as our measurement of severity. This is an imbalanced classification problem, as many people have a shorter LOS rather than a longer one. To combat this problem, we synthetically create alternate oversampled training data sets. Once we have this oversampled data, we run it through an Artificial Neural Network (ANN), which during training has its hyperparameters tuned using Bayesian optimization. We select the model with the best F1 score and then evaluate it and discuss it.

[314] Interpretable temporal fusion network of multi- and multi-class arrhythmia classification

Yun Kwan Kim

Main category: cs.LG

TL;DR: Proposed a framework for arrhythmia classification that combines local and global feature extraction with attention-based fusion to handle varying arrhythmia lengths, achieving superior performance on MIT-BIH databases.

DetailsMotivation: Existing clinical decision support systems struggle with varying arrhythmia lengths and onset times, which previous methods haven't adequately addressed.

Method: Framework with local and global feature extraction, followed by local-global information fusion using attention mechanisms to enable arrhythmia detection within constrained input lengths.

Result: Achieved F1-scores of 96.45% (duration), 82.05% (episode), 96.31% (Dice) on MITDB and 97.57%, 98.31%, 97.45% on AFDB, outperforming benchmark models with statistical significance.

Conclusion: The method effectively captures local and global information without significant loss, enabling accurate arrhythmia detection and precise timing determination for improved clinical treatment planning.

Abstract: Clinical decision support systems (CDSSs) have been widely utilized to support the decisions made by cardiologists when detecting and classifying arrhythmia from electrocardiograms. However, forming a CDSS for the arrhythmia classification task is challenging due to the varying lengths of arrhythmias. Although the onset time of arrhythmia varies, previously developed methods have not considered such conditions. Thus, we propose a framework that consists of (i) local and global extraction and (ii) local-global information fusion with attention to enable arrhythmia detection and classification within a constrained input length. The framework’s performance was evaluated in terms of 10-class and 4-class arrhythmia detection, focusing on identifying the onset and ending point of arrhythmia episodes and their duration using the MIT-BIH arrhythmia database (MITDB) and the MIT-BIH atrial fibrillation database (AFDB). Duration, episode, and Dice score performances resulted in overall F1-scores of 96.45%, 82.05%, and 96.31% on the MITDB and 97.57%, 98.31%, and 97.45% on the AFDB, respectively. The results demonstrated statistically superior performance compared to those of the benchmark models. To assess the generalization capability of the proposed method, an MITDB-trained model and MIT-BIH malignant ventricular arrhythmia database-trained model were tested AFDB and MITDB, respectively. Superior performance was attained compared with that of a state-of-the-art model. The proposed method effectively captures both local and global information and dynamics without significant information loss. Consequently, arrhythmias can be detected with greater accuracy, and their occurrence times can be precisely determined, enabling the clinical field to develop more accurate treatment plans based on the proposed method.

[315] Deep Pathomic Learning Defines Prognostic Subtypes and Molecular Drivers in Colorectal Cancer

Zisong Wang, Xuanyu Wang, Hang Chen, Haizhou Wang, Yuxin Chen, Yihang Xu, Yunhe Yuan, Lihuan Luo, Xitong Ling, Xiaoping Liu

Main category: cs.LG

TL;DR: Developed TDAM-CRC, a multiple instance learning model using histopathological images for colorectal cancer prognosis, validated across cohorts and integrated with multi-omics data to identify MRPL37 as a key prognostic biomarker.

DetailsMotivation: Conventional TNM staging is inadequate for personalized CRC medicine due to tumor heterogeneity, requiring more precise prognostic stratification tools.

Method: Multiple instance learning model trained on TCGA cohort (n=581), validated externally (n=1031), integrated multi-omics data for interpretability, and identified biomarkers through interaction network analysis.

Result: TDAM-CRC achieved robust risk stratification, outperformed clinical staging and state-of-the-art models, identified MRPL37 as key hub gene linked to favorable prognosis via promoter hypomethylation, and created clinical nomogram.

Conclusion: AI-driven TDAM-CRC provides improved CRC risk stratification, reveals molecular targets, and facilitates personalized clinical decision-making.

Abstract: Precise prognostic stratification of colorectal cancer (CRC) remains a major clinical challenge due to its high heterogeneity. The conventional TNM staging system is inadequate for personalized medicine. We aimed to develop and validate a novel multiple instance learning model TDAM-CRC using histopathological whole-slide images for accurate prognostic prediction and to uncover its underlying molecular mechanisms. We trained the model on the TCGA discovery cohort (n=581), validated it in an independent external cohort (n=1031), and further we integrated multi-omics data to improve model interpretability and identify novel prognostic biomarkers. The results demonstrated that the TDAM-CRC achieved robust risk stratification in both cohorts. Its predictive performance significantly outperformed the conventional clinical staging system and multiple state-of-the-art models. The TDAM-CRC risk score was confirmed as an independent prognostic factor in multivariable analysis. Multi-omics analysis revealed that the high-risk subtype is closely associated with metabolic reprogramming and an immunosuppressive tumor microenvironment. Through interaction network analysis, we identified and validated Mitochondrial Ribosomal Protein L37 (MRPL37) as a key hub gene linking deep pathomic features to clinical prognosis. We found that high expression of MRPL37, driven by promoter hypomethylation, serves as an independent biomarker of favorable prognosis. Finally, we constructed a nomogram incorporating the TDAM-CRC risk score and clinical factors to provide a precise and interpretable clinical decision-making tool for CRC patients. Our AI-driven pathological model TDAM-CRC provides a robust tool for improved CRC risk stratification, reveals new molecular targets, and facilitates personalized clinical decision-making.

[316] Fourier-KAN-Mamba: A Novel State-Space Equation Approach for Time-Series Anomaly Detection

Xiancheng Wang, Lin Wang, Rui Wang, Zhibo Zhang, Minghang Zhao

Main category: cs.LG

TL;DR: Fourier-KAN-Mamba: A hybrid architecture combining Fourier layers, KAN networks, and Mamba state-space models for improved time-series anomaly detection.

DetailsMotivation: Mamba-based models show efficiency in long-sequence modeling but struggle with complex temporal patterns and nonlinear dynamics in anomaly detection tasks.

Method: Integrates Fourier layer for multi-scale frequency features, KAN for nonlinear representation, and temporal gating control mechanism to distinguish normal/anomalous patterns.

Result: Extensive experiments on MSL, SMAP, and SWaT datasets show significant performance improvements over state-of-the-art approaches.

Conclusion: The proposed hybrid architecture effectively enhances time-series anomaly detection by combining frequency analysis, nonlinear modeling, and selective state-space processing.

Abstract: Time-series anomaly detection plays a critical role in numerous real-world applications, including industrial monitoring and fault diagnosis. Recently, Mamba-based state-space models have shown remarkable efficiency in long-sequence modeling. However, directly applying Mamba to anomaly detection tasks still faces challenges in capturing complex temporal patterns and nonlinear dynamics. In this paper, we propose Fourier-KAN-Mamba, a novel hybrid architecture that integrates Fourier layer, Kolmogorov-Arnold Networks (KAN), and Mamba selective state-space model. The Fourier layer extracts multi-scale frequency features, KAN enhances nonlinear representation capability, and a temporal gating control mechanism further improves the model’s ability to distinguish normal and anomalous patterns. Extensive experiments on MSL, SMAP, and SWaT datasets demonstrate that our method significantly outperforms existing state-of-the-art approaches. Keywords: time-series anomaly detection, state-space model, Mamba, Fourier transform, Kolmogorov-Arnold Network

[317] Semiconductor Industry Trend Prediction with Event Intervention Based on LSTM Model in Sentiment-Enhanced Time Series Data

Wei-hsiang Yen, Lyn Chao-ling Chen

Main category: cs.LG

TL;DR: Integration of deep learning and sentiment analysis for TSMC’s semiconductor industry trend prediction using LSTM model with sentiment-enhanced time series data.

DetailsMotivation: Traditional data analysis methods perform poorly with high variety and time series data in rapidly changing semiconductor markets. Need to incorporate both internal company events and external global events for better industry trend forecasting.

Method: Collected textual and time series data from TSMC seasonal reports. Applied sentiment analysis considering internal and external event interventions. Used LSTM model with sentiment-enhanced time series data for industry trend prediction.

Result: Prediction results revealed significant wafer technology development at TSMC and potential global market threats, matching actual product release news and international events.

Conclusion: The approach successfully predicts semiconductor industry trends by considering both internal and external events, providing valuable insights for research and business applications in the semiconductor sector.

Abstract: The innovation of the study is that the deep learning method and sentiment analysis are integrated in traditional business model analysis and forecasting, and the research subject is TSMC for industry trend prediction of semiconductor industry in Taiwan. For the rapid market changes and development of wafer technologies of semiconductor industry, traditional data analysis methods not perform well in the high variety and time series data. Textual data and time series data were collected from seasonal reports of TSMC including financial information. Textual data through sentiment analysis by considering the event intervention both from internal events of the company and the external global events. Using the sentiment-enhanced time series data, the LSTM model was adopted for predicting industry trend of TSMC. The prediction results reveal significant development of wafer technology of TSMC and the potential threatens in the global market, and matches the product released news of TSMC and the international news. The contribution of the work performed accurately in industry trend prediction of the semiconductor industry by considering both the internal and external event intervention, and the prediction results provide valuable information of semiconductor industry both in research and business aspects.

[318] Efficient RF Passive Components Modeling with Bayesian Online Learning and Uncertainty Aware Sampling

Huifan Zhang, Pingqiang Zhou

Main category: cs.LG

TL;DR: Bayesian online learning framework for efficient RF passive components modeling with 35x speedup using only 2.86% EM simulation time.

DetailsMotivation: Traditional machine learning modeling of RF passive components requires extensive EM simulations, creating computational bottlenecks.

Method: Bayesian neural network with reconfigurable heads for joint geometric-frequency domain modeling with uncertainty quantification, plus adaptive sampling strategy using uncertainty guidance.

Result: Achieved accurate modeling with only 2.86% EM simulation time compared to traditional ML-based flow, providing 35 times speedup.

Conclusion: The uncertainty-aware Bayesian online learning framework enables efficient parametric modeling of RF passive components with significant computational savings.

Abstract: Conventional radio frequency (RF) passive components modeling based on machine learning requires extensive electromagnetic (EM) simulations to cover geometric and frequency design spaces, creating computational bottlenecks. In this paper, we introduce an uncertainty-aware Bayesian online learning framework for efficient parametric modeling of RF passive components, which includes: 1) a Bayesian neural network with reconfigurable heads for joint geometric-frequency domain modeling while quantifying uncertainty; 2) an adaptive sampling strategy that simultaneously optimizes training data sampling across geometric parameters and frequency domain using uncertainty guidance. Validated on three RF passive components, the framework achieves accurate modeling while using only 2.86% EM simulation time compared to traditional ML-based flow, achieving a 35 times speedup.

[319] Novel sparse matrix algorithm expands the feasible size of a self-organizing map of the knowledge indexed by a database of peer-reviewed medical literature

Andrew Amos, Joanne Lee, Tarun Sen Gupta, Bunmi S. Malau-Aduli

Main category: cs.LG

TL;DR: Developed a novel sparse matrix multiplication algorithm to create a self-organizing map of the entire Medline database, overcoming previous memory and processing limitations.

DetailsMotivation: Previous attempts to map Medline were limited to small subsets due to exponentially increasing computational demands of existing algorithms.

Method: Designed a novel algorithm for sparse matrix multiplication that enables application of self-organizing maps to the entire Medline dataset.

Result: Successfully created a more complete map of existing medical knowledge and made it feasible to refine the map over time as the dataset changes.

Conclusion: The new algorithm overcomes previous computational limitations and enables comprehensive mapping and continuous refinement of medical knowledge from the complete Medline database.

Abstract: Past efforts to map the Medline database have been limited to small subsets of the available data because of the exponentially increasing memory and processing demands of existing algorithms. We designed a novel algorithm for sparse matrix multiplication that allowed us to apply a self-organizing map to the entire Medline dataset, allowing for a more complete map of existing medical knowledge. The algorithm also increases the feasibility of refining the self-organizing map to account for changes in the dataset over time.

[320] From Solving to Verifying: A Unified Objective for Robust Reasoning in LLMs

Xiaoxuan Wang, Bo Liu, Song Jiang, Jingzhou Liu, Jingyuan Qi, Xia Chen, Baosheng He

Main category: cs.LG

TL;DR: GRPO-Verif is a reinforcement learning algorithm that jointly optimizes solution generation and self-verification in LLMs through a unified loss function with adjustable verification weight.

DetailsMotivation: LLMs struggle to consistently verify their own reasoning traces despite improved reasoning capabilities through RL, raising the question of whether enhancing self-verification can further improve reasoning performance.

Method: Proposed GRPO-Verif algorithm with unified loss function that jointly optimizes solution generation and self-verification, featuring an adjustable hyperparameter to control verification signal weight.

Result: Experimental results show enhanced self-verification capability while maintaining comparable reasoning performance.

Conclusion: Joint optimization of solution generation and self-verification through GRPO-Verif effectively enhances LLMs’ self-verification ability without compromising reasoning performance.

Abstract: The reasoning capabilities of large language models (LLMs) have been significantly improved through reinforcement learning (RL). Nevertheless, LLMs still struggle to consistently verify their own reasoning traces. This raises the research question of how to enhance the self-verification ability of LLMs and whether such an ability can further improve reasoning performance. In this work, we propose GRPO-Verif, an algorithm that jointly optimizes solution generation and self-verification within a unified loss function, with an adjustable hyperparameter controlling the weight of the verification signal. Experimental results demonstrate that our method enhances self-verification capability while maintaining comparable performance in reasoning.

[321] Cross-Modal Consistency-Guided Active Learning for Affective BCI Systems

Hyo-Jeong Jang, Hye-Bin Shin, Kang Yin

Main category: cs.LG

TL;DR: An uncertainty-aware active learning framework for EEG-based emotion recognition that uses cross-modal consistency to distinguish between cognitive ambiguity and sensor noise, improving robustness to label noise.

DetailsMotivation: EEG signals are often corrupted by artifacts and individual variability, while emotional labels are subjective and inconsistent, making robust affective decoding challenging. Traditional methods struggle with label noise and data quality issues.

Method: Proposes a framework that leverages model uncertainty and cross-modal consistency between EEG and face features. A representation alignment module embeds both modalities into a shared latent space, and residual discrepancies are used to identify noise-induced inconsistencies for selective oracle feedback during active learning.

Result: Experiments on the ASCERTAIN dataset demonstrate the method’s efficiency and robustness, showing improved performance in EEG-based affective decoding while being data-efficient and noise-tolerant.

Conclusion: The framework provides a promising approach for EEG-based emotion recognition in brain-computer interface systems by effectively handling label noise and improving data efficiency through uncertainty-aware active learning.

Abstract: Deep learning models perform best with abundant, high-quality labels, yet such conditions are rarely achievable in EEG-based emotion recognition. Electroencephalogram (EEG) signals are easily corrupted by artifacts and individual variability, while emotional labels often stem from subjective and inconsistent reports-making robust affective decoding particularly difficult. We propose an uncertainty-aware active learning framework that enhances robustness to label noise by jointly leveraging model uncertainty and cross-modal consistency. Instead of relying solely on EEG-based uncertainty estimates, the method evaluates cross-modal alignment to determine whether uncertainty originates from cognitive ambiguity or sensor noise. A representation alignment module embeds EEG and face features into a shared latent space, enforcing semantic coherence between modalities. Residual discrepancies are treated as noise-induced inconsistencies, and these samples are selectively queried for oracle feedback during active learning. This feedback-driven process guides the network toward reliable, informative samples and reduces the impact of noisy labels. Experiments on the ASCERTAIN dataset examine the efficiency and robustness of ours, highlighting its potential as a data-efficient and noise-tolerant approach for EEG-based affective decoding in brain-computer interface systems.

[322] Complex variational autoencoders admit Kähler structure

Andrew Gracyk

Main category: cs.LG

TL;DR: Complex VAEs exhibit Kähler geometric structure, with Fisher information metric derived for complex latent spaces. A computationally efficient Kähler potential is proposed that approximates the Fisher metric while preserving geometric structure.

DetailsMotivation: To extend Riemannian structure analysis from real-valued VAEs to complex VAEs, revealing Kähler geometry in complex latent spaces and developing efficient computational methods.

Method: Derived Fisher information metric for complex VAEs with complex Gaussian regularization, proposed Kähler potential derivative for complex Gaussian mixtures, and developed efficient computation via plurisubharmonic functions.

Result: Achieved Kähler potential relation under relative entropy, enabled efficient metric computation without large-scale automatic differentiation, and demonstrated smoother latent representations with fewer semantic outliers.

Conclusion: Complex VAEs naturally admit Kähler geometry, and the proposed methods provide efficient computational frameworks for regularizing latent spaces while preserving geometric structure.

Abstract: It has been discovered that latent-Euclidean variational autoencoders (VAEs) admit, in various capacities, Riemannian structure. We adapt these arguments but for complex VAEs with a complex latent stage. We show that complex VAEs reveal to some level Kähler geometric structure. Our methods will be tailored for decoder geometry. We derive the Fisher information metric in the complex case under a latent complex Gaussian regularization with trivial relation matrix. It is well known from statistical information theory that the Fisher information coincides with the Hessian of the Kullback-Leibler (KL) divergence. Thus, the metric Kähler potential relation is exactly achieved under relative entropy. We propose a Kähler potential derivative of complex Gaussian mixtures that has rough equivalence to the Fisher information metric while still being faithful to the underlying Kähler geometry. Computation of the metric via this potential is efficient, and through our potential, valid as a plurisubharmonic (PSH) function, large scale computational burden of automatic differentiation is displaced to small scale. We show that we can regularize the latent space with decoder geometry, and that we can sample in accordance with a weighted complex volume element. We demonstrate these strategies, at the exchange of sample variation, yield consistently smoother representations and fewer semantic outliers.

[323] FaultDiffusion: Few-Shot Fault Time Series Generation with Diffusion Model

Yi Xu, Zhigang Chen, Rui Wang, Yangfan Li, Fengxiao Tang, Ming Zhao, Jiaqi Liu

Main category: cs.LG

TL;DR: Proposes a diffusion-based few-shot fault time-series generation framework with positive-negative difference adapter and diversity loss to address data scarcity in industrial fault diagnosis.

DetailsMotivation: Fault diagnosis is crucial for industrial equipment reliability but limited by scarce fault data due to rare fault events and high annotation costs. Existing models struggle with few-shot fault generation due to domain gaps and high intra-class variability.

Method: Uses diffusion models with positive-negative difference adapter to leverage pre-trained normal data distributions for modeling fault domain discrepancies, and introduces diversity loss to prevent mode collapse through inter-sample difference regularization.

Result: Significantly outperforms traditional methods in authenticity and diversity, achieving state-of-the-art performance on key benchmarks.

Conclusion: The proposed framework effectively addresses few-shot fault time-series generation challenges and enables more reliable fault diagnosis in industrial monitoring systems.

Abstract: In industrial equipment monitoring, fault diagnosis is critical for ensuring system reliability and enabling predictive maintenance. However, the scarcity of fault data, due to the rarity of fault events and the high cost of data annotation, significantly hinders data-driven approaches. Existing time-series generation models, optimized for abundant normal data, struggle to capture fault distributions in few-shot scenarios, producing samples that lack authenticity and diversity due to the large domain gap and high intra-class variability of faults. To address this, we propose a novel few-shot fault time-series generation framework based on diffusion models. Our approach employs a positive-negative difference adapter, leveraging pre-trained normal data distributions to model the discrepancies between normal and fault domains for accurate fault synthesis. Additionally, a diversity loss is introduced to prevent mode collapse, encouraging the generation of diverse fault samples through inter-sample difference regularization. Experimental results demonstrate that our model significantly outperforms traditional methods in authenticity and diversity, achieving state-of-the-art performance on key benchmarks.

[324] Vehicle Routing Problems via Quantum Graph Attention Network Deep Reinforcement Learning

Le Tung Giang, Vu Hoang Viet, Nguyen Xuan Tung, Trinh Van Chien, Won-Joo Hwang

Main category: cs.LG

TL;DR: Proposes a Quantum Graph Attention Network (Q-GAT) that uses parameterized quantum circuits to replace MLPs in DRL for vehicle routing, reducing parameters by 50% while improving performance.

DetailsMotivation: Classical DRL models for VRP rely on large MLPs that are parameter-heavy and memory-bound, limiting scalability for large-scale routing problems.

Method: Hybrid quantum-classical approach using parameterized quantum circuits (PQCs) to replace MLPs at critical readout stages within a Graph Attention Network framework, trained with proximal policy optimization (PPO) and greedy/stochastic decoding.

Result: Reduces trainable parameters by more than 50%, achieves faster convergence, and reduces routing cost by about 5% compared to classical GAT baselines on VRP benchmarks.

Conclusion: PQC-enhanced GNNs show potential as compact and effective solvers for large-scale routing and logistics optimization problems.

Abstract: The vehicle routing problem (VRP) is a fundamental NP-hard task in intelligent transportation systems with broad applications in logistics and distribution. Deep reinforcement learning (DRL) with Graph Neural Networks (GNNs) has shown promise, yet classical models rely on large multi-layer perceptrons (MLPs) that are parameter-heavy and memory-bound. We propose a Quantum Graph Attention Network (Q-GAT) within a DRL framework, where parameterized quantum circuits (PQCs) replace conventional MLPs at critical readout stages. The hybrid model maintains the expressive capacity of graph attention encoders while reducing trainable parameters by more than 50%. Using proximal policy optimization (PPO) with greedy and stochastic decoding, experiments on VRP benchmarks show that Q-GAT achieves faster convergence and reduces routing cost by about 5% compared with classical GAT baselines. These results demonstrate the potential of PQC-enhanced GNNs as compact and effective solvers for large-scale routing and logistics optimization.

[325] Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning

Yuxuan Gu, Weimin Bai, Yifei Wang, Weijian Luo, He Sun

Main category: cs.LG

TL;DR: MARVAL accelerates masked auto-regressive diffusion models by distilling the diffusion chain into a single generation step, enabling 30x speedup while maintaining quality and enabling practical RL post-training.

DetailsMotivation: Vanilla masked auto-regressive diffusion models suffer from slow inference due to hierarchical structure (outer AR loop + inner diffusion chain), which limits efficiency and makes RL post-training impractical.

Method: Proposes MARVAL framework with: (1) score-based variational objective for distilling diffusion models into single-step generation, and (2) MARVAL-RL for efficient reinforcement learning of masked auto-regressive models.

Result: On ImageNet 256*256: MARVAL-Huge achieves FID 2.00 with 30x speedup vs MAR-diffusion; MARVAL-RL improves CLIP and image-reward scores on datasets with entity names.

Conclusion: MARVAL provides the first practical path for distillation and RL of masked auto-regressive diffusion models, enabling fast sampling and better preference alignment.

Abstract: Masked auto-regressive diffusion models (MAR) benefit from the expressive modeling ability of diffusion models and the flexibility of masked auto-regressive ordering. However, vanilla MAR suffers from slow inference due to its hierarchical inference mechanism: an outer AR unmasking loop and an inner diffusion denoising chain. Such decoupled structure not only harm the generation efficiency but also hinder the practical use of MAR for reinforcement learning (RL), an increasingly critical paradigm for generative model post-training.To address this fundamental issue, we introduce MARVAL (Masked Auto-regressive Variational Acceleration), a distillation-based framework that compresses the diffusion chain into a single AR generation step while preserving the flexible auto-regressive unmasking order. Such a distillation with MARVAL not only yields substantial inference acceleration but, crucially, makes RL post-training with verifiable rewards practical, resulting in scalable yet human-preferred fast generative models. Our contributions are twofold: (1) a novel score-based variational objective for distilling masked auto-regressive diffusion models into a single generation step without sacrificing sample quality; and (2) an efficient RL framework for masked auto-regressive models via MARVAL-RL. On ImageNet 256*256, MARVAL-Huge achieves an FID of 2.00 with more than 30 times speedup compared with MAR-diffusion, and MARVAL-RL yields consistent improvements in CLIP and image-reward scores on ImageNet datasets with entity names. In conclusion, MARVAL demonstrates the first practical path to distillation and RL of masked auto-regressive diffusion models, enabling fast sampling and better preference alignments.

[326] Reasoning in Diffusion Large Language Models is Concentrated in Dynamic Confusion Zones

Ranfei Chen, Ming Chen, Kaifei Wang

Main category: cs.LG

TL;DR: ATPO is a lightweight RL method that dynamically focuses gradient updates on high-leverage “zones of confusion” in diffusion LLM trajectories, improving reasoning accuracy without changing compute budget.

DetailsMotivation: Existing RL methods treat all denoising steps equally, but analysis reveals structured uncertainty patterns where certain steps are more critical for final success.

Method: Proposes Adaptive Trajectory Policy Optimization (ATPO) using step-level metrics (entropy, Confidence-Margin uncertainty, Rate of Entropy Change) to identify and prioritize high-leverage steps for gradient updates.

Result: ATPO delivers substantial gains in reasoning accuracy and training stability across benchmarks while using the same compute budget as uniform methods.

Conclusion: Exploiting trajectory dynamics through step-selection strategies is key to advancing diffusion LLM reinforcement learning.

Abstract: Diffusion Large Language Models (dLLMs) are rapidly emerging alongside autoregressive models as a powerful paradigm for complex reasoning, with reinforcement learning increasingly used for downstream alignment. Existing trajectory-based RL methods uniformly allocate policy gradients across denoising steps, implicitly treating all steps as equally important. We challenge this assumption by analyzing trajectories with several step-level metrics: entropy-based uncertainty, Confidence-Margin (CM) uncertainty, and Rate of Entropy Change (RoEC). These reveal structured “zones of confusion”: transient spikes in uncertainty and instability that strongly predict final success or failure, while most steps remain stable. We propose Adaptive Trajectory Policy Optimization (ATPO), a lightweight step-selection strategy that dynamically reallocates gradient updates to these high-leverage steps without changing the RL objective, rewards, or compute budget. Using a hybrid RoEC+CM rule, ATPO delivers substantial gains in reasoning accuracy and training stability across benchmarks, showing that exploiting trajectory dynamics is key to advancing dLLM RL.

[327] D2D Power Allocation via Quantum Graph Neural Network

Tung Giang Le, Xuan Tung Nguyen, Won-Joo Hwang

Main category: cs.LG

TL;DR: A quantum Graph Neural Network (QGNN) using Parameterized Quantum Circuits achieves comparable performance to classical GNNs for wireless resource management with fewer parameters and inherent parallelism.

DetailsMotivation: Increasing complexity of wireless networks requires scalable resource management solutions, while classical GNNs face high computational costs in large-scale settings.

Method: Developed a fully quantum GNN with Quantum Graph Convolutional Layers (QGCLs) that encode features into quantum states, process graphs using NISQ-compatible unitaries, and retrieve embeddings through measurement.

Result: Applied to D2D power control for SINR maximization, the QGNN matches classical performance while using fewer parameters and leveraging inherent quantum parallelism.

Conclusion: This end-to-end PQC-based GNN represents progress toward quantum-accelerated wireless optimization.

Abstract: Increasing wireless network complexity demands scalable resource management. Classical GNNs excel at graph learning but incur high computational costs in large-scale settings. We present a fully quantum Graph Neural Network (QGNN) that implements message passing via Parameterized Quantum Circuits (PQCs). Our Quantum Graph Convolutional Layers (QGCLs) encode features into quantum states, process graphs with NISQ-compatible unitaries, and retrieve embeddings through measurement. Applied to D2D power control for SINR maximization, our QGNN matches classical performance with fewer parameters and inherent parallelism. This end-to-end PQC-based GNN marks a step toward quantum-accelerated wireless optimization.

[328] EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control

Kai Yang, Xin Xu, Yangkun Chen, Weijie Liu, Jiafei Lyu, Zichuan Lin, Deheng Ye, Saiyong Yang

Main category: cs.LG

TL;DR: Proposes EntroPIC, a method using proportional-integral control to stabilize entropy in LLM training by dynamically adjusting loss coefficients for positive/negative samples, enabling stable exploration and preventing premature convergence.

DetailsMotivation: Existing RL methods struggle to maintain appropriate entropy levels during LLM training due to mixed positive/negative samples affecting entropy differently across steps, leading to sub-optimal convergence.

Method: Entropy stabilization via Proportional-Integral Control (EntroPIC) that adaptively tunes loss coefficients for positive and negative samples to maintain stable entropy throughout training.

Result: Theoretical analysis shows EntroPIC effectively controls entropy in both on-policy and off-policy settings. Experiments demonstrate successful maintenance of desired entropy levels for stable LLM training.

Conclusion: EntroPIC enables stable and optimal RL training for large language models by ensuring efficient exploration through entropy stabilization.

Abstract: Long-term training of large language models (LLMs) requires maintaining stable exploration to prevent the model from collapsing into sub-optimal behaviors. Entropy is crucial in this context, as it controls exploration and helps avoid premature convergence to sub-optimal solutions. However, existing reinforcement learning methods struggle to maintain an appropriate level of entropy, as the training process involves a mix of positive and negative samples, each affecting entropy in different ways across steps. To address this, we propose Entropy stablilization via Proportional-Integral Control (EntroPIC), a novel method that adaptively adjusts the influence of positive and negative samples by dynamically tuning their loss coefficients. This approach stabilizes entropy throughout training, ensuring efficient exploration and steady progress. We provide a comprehensive theoretical analysis for both on-policy and off-policy learning settings, demonstrating that EntroPIC is effective at controlling entropy in large-scale LLM training. Experimental results show that our method successfully maintains desired entropy levels, enabling stable and optimal RL training for LLMs.

[329] Optimized scheduling of electricity-heat cooperative system considering wind energy consumption and peak shaving and valley filling

Jin Ye, Lingmei Wang, Shujian Zhang, Haihang WU

Main category: cs.LG

TL;DR: Proposes PVTD3 algorithm for combined power-heat system scheduling, reducing costs by 6.93-13.59% and grid power fluctuations by 12.8% across renewable penetration scenarios.

DetailsMotivation: Address scheduling optimization challenges in combined power-heat systems under renewable energy integration and multiple uncertainties during global energy transition.

Method: Improved Dual-Delay Deep Deterministic Policy Gradient (PVTD3) algorithm with penalty term for grid power purchase variations.

Result: PVTD3 reduces system comprehensive cost by 6.93%, 12.68%, 13.59% at 10%, 20%, 30% renewable penetration; reduces grid power fluctuation by 12.8%; improves energy storage management with reduced low-temperature tank end-states and safe high-temperature tank operation.

Conclusion: PVTD3 algorithm demonstrates superior economic efficiency, grid stability, and sustainable scheduling capabilities in energy storage management for combined power-heat systems.

Abstract: With the global energy transition and rapid development of renewable energy, the scheduling optimization challenge for combined power-heat systems under new energy integration and multiple uncertainties has become increasingly prominent. Addressing this challenge, this study proposes an intelligent scheduling method based on the improved Dual-Delay Deep Deterministic Policy Gradient (PVTD3) algorithm. System optimization is achieved by introducing a penalty term for grid power purchase variations. Simulation results demonstrate that under three typical scenarios (10%, 20%, and 30% renewable penetration), the PVTD3 algorithm reduces the system’s comprehensive cost by 6.93%, 12.68%, and 13.59% respectively compared to the traditional TD3 algorithm. Concurrently, it reduces the average fluctuation amplitude of grid power purchases by 12.8%. Regarding energy storage management, the PVTD3 algorithm reduces the end-time state values of low-temperature thermal storage tanks by 7.67-17.67 units while maintaining high-temperature tanks within the 3.59-4.25 safety operating range. Multi-scenario comparative validation demonstrates that the proposed algorithm not only excels in economic efficiency and grid stability but also exhibits superior sustainable scheduling capabilities in energy storage device management.

[330] PLATONT: Learning a Platonic Representation for Unified Network Tomography

Chengze Du, Heng Xu, Zhiwei Yu, Bo Liu, Jialong Li

Main category: cs.LG

TL;DR: PLATONT is a unified framework that models different network indicators as projections of a shared latent network state, enabling improved cross-task generalization in network tomography.

DetailsMotivation: Existing network tomography methods solve problems separately using limited task-specific signals, which limits generalization and interpretability.

Method: PLATONT learns a shared latent network state through multimodal alignment and contrastive learning, training multiple tomography tasks within a shared latent space.

Result: Experiments show PLATONT consistently outperforms existing methods in link estimation, topology inference, and traffic prediction with higher accuracy and stronger robustness.

Conclusion: The framework successfully builds compact and structured representations that improve cross-task generalization in network tomography.

Abstract: Network tomography aims to infer hidden network states, such as link performance, traffic load, and topology, from external observations. Most existing methods solve these problems separately and depend on limited task-specific signals, which limits generalization and interpretability. We present PLATONT, a unified framework that models different network indicators (e.g., delay, loss, bandwidth) as projections of a shared latent network state. Guided by the Platonic Representation Hypothesis, PLATONT learns this latent state through multimodal alignment and contrastive learning. By training multiple tomography tasks within a shared latent space, it builds compact and structured representations that improve cross-task generalization. Experiments on synthetic and real-world datasets show that PLATONT consistently outperforms existing methods in link estimation, topology inference, and traffic prediction, achieving higher accuracy and stronger robustness under varying network conditions.

[331] GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning

Yanchen Xu, Ziheng Jiao, Hongyuan Zhang, Xuelong Li

Main category: cs.LG

TL;DR: GRPO-RM extends Group Relative Policy Optimization from LLMs to representation learning models, using predefined output sets and specialized rewards to improve representation model performance.

DetailsMotivation: To investigate whether the successful GRPO method used for fine-tuning LLMs can be generalized to representation learning models, bridging the gap between policy optimization techniques across different model types.

Method: Proposes GRPO-RM which establishes predefined output sets to replace token sampling, generates output groups for probability-driven optimization, and designs specialized reward functions tailored for representation models.

Result: Extensive experiments on various real-world datasets validate the effectiveness of the proposed GRPO-RM method for representation models.

Conclusion: GRPO-RM successfully generalizes the GRPO framework to representation learning models, demonstrating that GRPO-like policies can be effectively applied in post-training representation models.

Abstract: The Group Relative Policy Optimization (GRPO), a reinforcement learning method used to fine-tune large language models (LLMs), has proved its effectiveness in practical applications such as DeepSeek-R1. It raises a question whether GRPO can be generalized to representation learning models. In this paper, we propose Group Relative Policy Optimization for Representation Model (GRPO-RM), and investigate the performance of GRPO-like policy in post-training representation models. Specifically, our method establishes a predefined output set to functionally replace token sequence sampling in LLMs, thereby generating an output group, which is essential for the probability-driven optimization of GRPO. In addition, a specialized reward function is designed to accommodate the properties of representation models. Extensive experiments are conducted on various real-world datasets to validate the effectiveness of our proposed method.

[332] SNAP: Low-Latency Test-Time Adaptation with Sparse Updates

Hyeongheon Cha, Dong Min Kim, Hye Won Chung, Taesik Gong, Sung-Ju Lee

Main category: cs.LG

TL;DR: SNAP is a sparse Test-Time Adaptation framework that reduces adaptation frequency and data usage while maintaining accuracy, making TTA practical for resource-constrained edge devices.

DetailsMotivation: Existing TTA methods require frequent adaptation and high computational costs, making them unsuitable for edge environments with limited resources.

Method: SNAP introduces two components: Class and Domain Representative Memory (CnDRM) to store representative samples for efficient adaptation, and Inference-only Batch-aware Memory Normalization (IoBMN) to dynamically adjust normalization statistics using these samples.

Result: SNAP reduces latency by up to 93.12% while keeping accuracy drop below 3.3%, even when adapting with only 1% of incoming data across adaptation rates from 1% to 50%.

Conclusion: SNAP demonstrates strong potential for practical deployment on edge devices serving latency-sensitive applications by enabling efficient TTA with minimal data and computational overhead.

Abstract: Test-Time Adaptation (TTA) adjusts models using unlabeled test data to handle dynamic distribution shifts. However, existing methods rely on frequent adaptation and high computational cost, making them unsuitable for resource-constrained edge environments. To address this, we propose SNAP, a sparse TTA framework that reduces adaptation frequency and data usage while preserving accuracy. SNAP maintains competitive accuracy even when adapting based on only 1% of the incoming data stream, demonstrating its robustness under infrequent updates. Our method introduces two key components: (i) Class and Domain Representative Memory (CnDRM), which identifies and stores a small set of samples that are representative of both class and domain characteristics to support efficient adaptation with limited data; and (ii) Inference-only Batch-aware Memory Normalization (IoBMN), which dynamically adjusts normalization statistics at inference time by leveraging these representative samples, enabling efficient alignment to shifting target domains. Integrated with five state-of-the-art TTA algorithms, SNAP reduces latency by up to 93.12%, while keeping the accuracy drop below 3.3%, even across adaptation rates ranging from 1% to 50%. This demonstrates its strong potential for practical use on edge devices serving latency-sensitive applications. The source code is available at https://github.com/chahh9808/SNAP.

[333] Quant-Trim in Practice: Improved Cross-Platform Low-Bit Deployment on Edge NPUs

Rayen Dhahri, Steffen Urban

Main category: cs.LG

TL;DR: Quant-Trim is a training-phase method that creates hardware-neutral checkpoints robust to different quantization backends and precisions, eliminating the need for per-backend retraining or vendor-specific optimizations.

DetailsMotivation: Specialized edge accelerators use low-bit quantization, but vendor compilers differ in implementation details, causing inconsistent accuracy across backends from the same FP checkpoint, forcing practitioners to tweak flags or refactor models.

Method: Combines progressive fake quantization to align training with deployed integer grid and reverse pruning to tame outlier-driven scale inflation while preserving learnability. Agnostic to quantization schemes and requires no vendor-specific graph changes.

Result: Narrows the FP/low-bit gap, reduces dependence on compiler heuristics/calibration, avoids per-backend retraining, and maintains accuracy across models and tasks with improved edge metrics (latency, throughput, energy/inference, cost).

Conclusion: Quant-Trim provides a robust solution for quantization inconsistencies across edge accelerators, enabling hardware-neutral deployment without sacrificing accuracy or requiring backend-specific modifications.

Abstract: Specialized edge accelerators rely on low-bit quantization, but vendor compilers differ in scaling, clipping, and kernel support, often as black boxes. The same floating-point (FP) checkpoint can therefore yield inconsistent accuracy across backends, forcing practitioners to tweak flags or refactor models to vendor-friendly operator subsets. We introduce Quant-Trim, a training-phase method that produces a hardware-neutral checkpoint robust to backend and precision choices. It combines progressive fake quantization to align training with the deployed integer grid and reverse pruning to tame outlier-driven scale inflation while preserving learnability. Quant-Trim is agnostic to quantization schemes (symmetric/asymmetric,per-tensor/per-channel, INT8/INT4) and requires no vendor-specific graph changes.Across models and tasks, it narrows the FP,low-bit gap, reduces dependence on compiler heuristics/calibration, and avoids per-backend retraining. We report accuracy and edge metrics latency, throughput, energy/inference, and cost under static/dynamic activation scaling and varying operator coverage.

[334] On the Internal Semantics of Time-Series Foundation Models

Atharva Pandey, Abhilash Neog, Gautam Jajoo

Main category: cs.LG

TL;DR: Systematic investigation of concept interpretability in Time-series Foundation Models (TSFMs), revealing layer-wise encoding patterns, linear recoverability limitations, and compositional concept interference.

DetailsMotivation: Despite empirical success of TSFMs, their internal mechanisms for representing fundamental time-series concepts remain poorly understood, necessitating systematic analysis of concept interpretability.

Method: Layer-wise analyses, linear recoverability tests, and representation similarity measures to examine concept encoding across model layers and compositional processing.

Result: Early layers capture local time-domain patterns (AR(1), level shifts, trends), deeper layers encode dispersion and change-time signals, spectral/warping factors are hardest to recover linearly, and compositional settings show concept interference.

Conclusion: Atomic concepts are reliably localized but composition remains challenging, highlighting a key limitation in current TSFMs’ ability to represent interacting temporal phenomena.

Abstract: Time-series Foundation Models (TSFMs) have recently emerged as a universal paradigm for learning across diverse temporal domains. However, despite their empirical success, the internal mechanisms by which these models represent fundamental time-series concepts remain poorly understood. In this work, we undertake a systematic investigation of concept interpretability in TSFMs. Specifically, we examine: (i) which layers encode which concepts, (ii) whether concept parameters are linearly recoverable, (iii) how representations evolve in terms of concept disentanglement and abstraction across model depth, and (iv) how models process compositions of concepts. We systematically probe these questions using layer-wise analyses, linear recoverability tests, and representation similarity measures, providing a structured account of TSFM semantics. The resulting insights show that early layers mainly capture local, time-domain patterns (e.g., AR(1), level shifts, trends), while deeper layers encode dispersion and change-time signals, with spectral and warping factors remaining the hardest to recover linearly. In compositional settings, however, probe performance degrades, revealing interference between concepts. This highlights that while atomic concepts are reliably localized, composition remains a challenge, underscoring a key limitation in current TSFMs’ ability to represent interacting temporal phenomena.

[335] KrawtchoukNet: A Unified GNN Solution for Heterophily and Over-smoothing with Adaptive Bounded Polynomials

Huseyin Goksu

Main category: cs.LG

TL;DR: KrawtchoukNet is a spectral GNN using discrete Krawtchouk polynomials that solves two key limitations: heterophilic graph performance collapse and over-smoothing at high polynomial degrees.

DetailsMotivation: Standard spectral GNNs like ChebyNet fail on heterophilic graphs and suffer from over-smoothing at high polynomial degrees due to their static, low-pass filter nature.

Method: Proposes KrawtchoukNet with two key designs: 1) fixing polynomial domain N to a small constant for inherently bounded recurrence coefficients, 2) making the filter’s shape parameter p learnable for adaptive spectral response.

Result: Achieves SOTA robustness to over-smoothing at K=10 and SOTA performance on heterophilic benchmarks (Texas, Cornell), outperforming GAT and APPNP.

Conclusion: KrawtchoukNet provides a unified solution to both heterophilic graph performance and over-smoothing problems through bounded coefficients and adaptive filtering.

Abstract: Spectral Graph Neural Networks (GNNs) based on polynomial filters, such as ChebyNet, suffer from two critical limitations: 1) performance collapse on “heterophilic” graphs and 2) performance collapse at high polynomial degrees (K), known as over-smoothing. Both issues stem from the static, low-pass nature of standard filters. In this work, we propose KrawtchoukNet, a GNN filter based on the discrete Krawtchouk polynomials. We demonstrate that KrawtchoukNet provides a unified solution to both problems through two key design choices. First, by fixing the polynomial’s domain N to a small constant (e.g., N=20), we create the first GNN filter whose recurrence coefficients are \textit{inherently bounded}, making it exceptionally robust to over-smoothing (achieving SOTA results at K=10). Second, by making the filter’s shape parameter p learnable, the filter adapts its spectral response to the graph data. We show this adaptive nature allows KrawtchoukNet to achieve SOTA performance on challenging heterophilic benchmarks (Texas, Cornell), decisively outperforming standard GNNs like GAT and APPNP.

[336] LaguerreNet: Advancing a Unified Solution for Heterophily and Over-smoothing with Adaptive Continuous Polynomials

Huseyin Goksu

Main category: cs.LG

TL;DR: LaguerreNet introduces continuous Laguerre polynomials as GNN filters with trainable alpha parameters to address heterophilic graph performance and over-smoothing issues, achieving SOTA results with improved stability.

DetailsMotivation: Spectral GNNs suffer from poor performance on heterophilic graphs and over-smoothing at high polynomial degrees due to static, low-pass filters. Adaptive polynomial filters need extension to continuous domains and stability solutions.

Method: Proposes LaguerreNet using continuous Laguerre polynomials with trainable alpha parameter for adaptive filtering. Solves numerical instability with LayerNorm-based stabilization technique.

Result: Achieves state-of-the-art results on heterophilic benchmarks and exceptional robustness to over-smoothing, with performance peaking at K=10 (10x beyond ChebyNet collapse point).

Conclusion: LaguerreNet effectively addresses key limitations of spectral GNNs through adaptive continuous polynomial filters with trainable parameters and numerical stabilization.

Abstract: Spectral Graph Neural Networks (GNNs) suffer from two critical limitations: poor performance on “heterophilic” graphs and performance collapse at high polynomial degrees (K), known as over-smoothing. Both issues stem from the static, low-pass nature of standard filters (e.g., ChebyNet). While adaptive polynomial filters, such as the discrete MeixnerNet, have emerged as a potential unified solution, their extension to the continuous domain and stability with unbounded coefficients remain open questions. In this work, we propose LaguerreNet, a novel GNN filter based on continuous Laguerre polynomials. LaguerreNet learns the filter’s spectral shape by making its core alpha parameter trainable, thereby advancing the adaptive polynomial approach. We solve the severe O(k^2) numerical instability of these unbounded polynomials using a LayerNorm-based stabilization technique. We demonstrate experimentally that this approach is highly effective: 1) LaguerreNet achieves state-of-the-art results on challenging heterophilic benchmarks. 2) It is exceptionally robust to over-smoothing, with performance peaking at K=10, an order of magnitude beyond where ChebyNet collapses.

[337] STREAM-VAE: Dual-Path Routing for Slow and Fast Dynamics in Vehicle Telemetry Anomaly Detection

Kadir-Kaan Özer, René Ebeling, Markus Enzweiler

Main category: cs.LG

TL;DR: STREAM-VAE is a dual-path VAE that separates slow drift and fast spike dynamics in automotive telemetry data for more robust anomaly detection.

DetailsMotivation: Standard reconstruction-based methods mix heterogeneous time scales in automotive telemetry data, smoothing out spikes or inflating variances, which weakens anomaly detection performance.

Method: Uses a dual-path encoder to separate slow drift and fast spike signal dynamics, with a decoder that represents transient deviations separately from normal operating patterns.

Result: Outperforms strong forecasting, attention, graph, and VAE baselines on automotive telemetry dataset and SMD benchmark, showing improved robustness.

Conclusion: Explicitly separating drift and spike dynamics in time-series data improves anomaly detection robustness for automotive telemetry applications.

Abstract: Automotive telemetry data exhibits slow drifts and fast spikes, often within the same sequence, making reliable anomaly detection challenging. Standard reconstruction-based methods, including sequence variational autoencoders (VAEs), use a single latent process and therefore mix heterogeneous time scales, which can smooth out spikes or inflate variances and weaken anomaly separation. In this paper, we present STREAM-VAE, a variational autoencoder for anomaly detection in automotive telemetry time-series data. Our model uses a dual-path encoder to separate slow drift and fast spike signal dynamics, and a decoder that represents transient deviations separately from the normal operating pattern. STREAM-VAE is designed for deployment, producing stable anomaly scores across operating modes for both in-vehicle monitors and backend fleet analytics. Experiments on an automotive telemetry dataset and the public SMD benchmark show that explicitly separating drift and spike dynamics improves robustness compared to strong forecasting, attention, graph, and VAE baselines.

[338] Multi-layer Stack Ensembles for Time Series Forecasting

Nathanael Bosch, Oleksandr Shchur, Nick Erickson, Michael Bohlke-Schneider, Caner Türkmen

Main category: cs.LG

TL;DR: Systematic exploration of 33 ensemble models for time series forecasting shows stacking consistently improves accuracy, with multi-layer stacking framework providing superior performance across diverse scenarios.

DetailsMotivation: Ensemble methods are underutilized in time series forecasting despite their effectiveness in tabular tasks, with simple linear combinations still considered state-of-the-art.

Method: Evaluated 33 ensemble models across 50 real-world datasets and proposed a multi-layer stacking framework that combines strengths of different stacker models.

Result: Stacking consistently improves accuracy, though no single stacker performs best across all tasks. Multi-layer stacking provides superior accuracy across diverse forecasting scenarios.

Conclusion: Stacking-based methods have significant potential to improve AutoML systems for time series forecasting.

Abstract: Ensembling is a powerful technique for improving the accuracy of machine learning models, with methods like stacking achieving strong results in tabular tasks. In time series forecasting, however, ensemble methods remain underutilized, with simple linear combinations still considered state-of-the-art. In this paper, we systematically explore ensembling strategies for time series forecasting. We evaluate 33 ensemble models – both existing and novel – across 50 real-world datasets. Our results show that stacking consistently improves accuracy, though no single stacker performs best across all tasks. To address this, we propose a multi-layer stacking framework for time series forecasting, an approach that combines the strengths of different stacker models. We demonstrate that this method consistently provides superior accuracy across diverse forecasting scenarios. Our findings highlight the potential of stacking-based methods to improve AutoML systems for time series forecasting.

[339] Cost-Aware Prediction (CAP): An LLM-Enhanced Machine Learning Pipeline and Decision Support System for Heart Failure Mortality Prediction

Yinan Yu, Falk Dippel, Christina E. Lundberg, Martin Lindgren, Annika Rosengren, Martin Adiels, Helen Sjöland

Main category: cs.LG

TL;DR: A cost-aware prediction framework combining ML with LLM agents for clinical decision support in heart failure patients, focusing on cost-benefit analysis and interpretability.

DetailsMotivation: Traditional ML models lack consideration of downstream value trade-offs and clinical interpretability, limiting their practical utility in healthcare decision-making.

Method: Developed ML model for 1-year mortality prediction in heart failure patients, introduced clinical impact projection curves for cost visualization, and used four LLM agents for patient-specific descriptions.

Result: XGB model achieved AUROC of 0.804, AUPRC of 0.529, and Brier score of 0.135. System was well-received by clinicians but needs improved technical accuracy for speculative tasks.

Conclusion: CAP framework successfully integrates ML predictions with LLM-generated cost-benefit analysis for more transparent and interpretable clinical decision support.

Abstract: Objective: Machine learning (ML) predictive models are often developed without considering downstream value trade-offs and clinical interpretability. This paper introduces a cost-aware prediction (CAP) framework that combines cost-benefit analysis assisted by large language model (LLM) agents to communicate the trade-offs involved in applying ML predictions. Materials and Methods: We developed an ML model predicting 1-year mortality in patients with heart failure (N = 30,021, 22% mortality) to identify those eligible for home care. We then introduced clinical impact projection (CIP) curves to visualize important cost dimensions - quality of life and healthcare provider expenses, further divided into treatment and error costs, to assess the clinical consequences of predictions. Finally, we used four LLM agents to generate patient-specific descriptions. The system was evaluated by clinicians for its decision support value. Results: The eXtreme gradient boosting (XGB) model achieved the best performance, with an area under the receiver operating characteristic curve (AUROC) of 0.804 (95% confidence interval (CI) 0.792-0.816), area under the precision-recall curve (AUPRC) of 0.529 (95% CI 0.502-0.558) and a Brier score of 0.135 (95% CI 0.130-0.140). Discussion: The CIP cost curves provided a population-level overview of cost composition across decision thresholds, whereas LLM-generated cost-benefit analysis at individual patient-levels. The system was well received according to the evaluation by clinicians. However, feedback emphasizes the need to strengthen the technical accuracy for speculative tasks. Conclusion: CAP utilizes LLM agents to integrate ML classifier outcomes and cost-benefit analysis for more transparent and interpretable decision support.

[340] CID: Measuring Feature Importance Through Counterfactual Distributions

Eddie Conti, Álvaro Parafita, Axel Brando

Main category: cs.LG

TL;DR: Introduces Counterfactual Importance Distribution (CID), a novel post-hoc local feature importance method that uses counterfactuals and distributional dissimilarity to rank features, improving faithfulness metrics.

DetailsMotivation: Need for well-founded feature importance measures due to lack of definitive ground truth in existing methods, requiring alternative rigorous approaches.

Method: Generate positive and negative counterfactuals, model their distributions using Kernel Density Estimation, and rank features based on distributional dissimilarity measure that satisfies metric properties.

Result: CID outperforms established local feature importance explainers, improves faithfulness metrics (comprehensiveness and sufficiency), and provides more faithful explanations.

Conclusion: CID serves as a valuable tool for model analysis by offering complementary perspectives and rigorous mathematical framework for feature importance assessment.

Abstract: Assessing the importance of individual features in Machine Learning is critical to understand the model’s decision-making process. While numerous methods exist, the lack of a definitive ground truth for comparison highlights the need for alternative, well-founded measures. This paper introduces a novel post-hoc local feature importance method called Counterfactual Importance Distribution (CID). We generate two sets of positive and negative counterfactuals, model their distributions using Kernel Density Estimation, and rank features based on a distributional dissimilarity measure. This measure, grounded in a rigorous mathematical framework, satisfies key properties required to function as a valid metric. We showcase the effectiveness of our method by comparing with well-established local feature importance explainers. Our method not only offers complementary perspectives to existing approaches, but also improves performance on faithfulness metrics (both for comprehensiveness and sufficiency), resulting in more faithful explanations of the system. These results highlight its potential as a valuable tool for model analysis.

[341] Parameter Importance-Driven Continual Learning for Foundation Models

Lingxiang Wang, Hainan Zhang, Zhiming Zheng

Main category: cs.LG

TL;DR: PIECE is a parameter-efficient continual learning method that selectively updates only 0.1% of core parameters using importance estimation to prevent catastrophic forgetting while learning new domain knowledge.

DetailsMotivation: Domain-specific post-training causes catastrophic forgetting in foundation models, making them lose general reasoning ability. Traditional continual learning methods have limitations like poor downstream performance, reliance on historical data, or parameter overhead.

Method: PIECE uses parameter importance estimation to selectively update only 0.1% of core parameters relevant to new tasks. It employs two importance estimators: PIECE-F based on Fisher Information, and PIECE-S based on second-order normalization combining gradient and curvature information.

Result: Experiments across three language models and two multimodal models show PIECE maintains general capabilities and achieves state-of-the-art continual learning performance across diverse downstream tasks.

Conclusion: PIECE provides a practical path to scalable, domain-adaptive foundation models without catastrophic forgetting, enabling preservation of general abilities while efficiently learning domain knowledge.

Abstract: Domain-specific post-training often causes catastrophic forgetting, making foundation models lose their general reasoning ability and limiting their adaptability to dynamic real-world environments. Preserving general capabilities while acquiring downstream domain knowledge is a central challenge for large language and multimodal models. Traditional continual learning methods, such as regularization, replay and architectural isolation, suffer from poor downstream performance, reliance on inaccessible historical data, or additional parameter overhead. While recent parameter-efficient tuning (PET) methods can alleviate forgetting, their effectiveness strongly depends on the choice of parameters and update strategies. In this paper, we introduce PIECE, a Parameter Importance Estimation-based Continual Enhancement method that preserves general ability while efficiently learning domain knowledge without accessing prior training data or increasing model parameters. PIECE selectively updates only 0.1% of core parameters most relevant to new tasks, guided by two importance estimators: PIECE-F based on Fisher Information, and PIECE-S based on a second-order normalization that combines gradient and curvature information. Experiments across three language models and two multimodal models show that PIECE maintains general capabilities and achieves state-of-the-art continual learning performance across diverse downstream tasks. Our results highlight a practical path to scalable, domain-adaptive foundation models without catastrophic forgetting.

[342] EVA-Net: Interpretable Brain Age Prediction via Continuous Aging Prototypes from EEG

Kunyu Zhang, Mingxuan Wang, Xiangjie Shi, Haoxing Xu, Chao Zhang

Main category: cs.LG

TL;DR: EVA-Net is an interpretable framework that treats brain age estimation as an anomaly detection problem, using transformers and variational information bottleneck to handle imperfect EEG data and learn healthy aging patterns.

DetailsMotivation: Existing EEG-based brain age models struggle with imperfect medical data and lack interpretability, making it difficult to identify disease through anomaly detection from healthy baselines.

Method: Uses sparsified-attention Transformer for long EEG sequences, Variational Information Bottleneck for robust representations, and continuous prototype network to learn normative healthy aging manifold.

Result: Achieved state-of-the-art accuracy on 1297 healthy subjects; validated on 27 MCI/AD patients showing significantly higher brain-age gaps and novel Prototype Alignment Error.

Conclusion: EVA-Net provides an interpretable framework for healthcare intelligence using imperfect medical data by explicitly modeling healthy aging patterns and detecting deviations.

Abstract: The brain age is a key indicator of brain health. While electroencephalography (EEG) is a practical tool for this task, existing models struggle with the common challenge of imperfect medical data, such as learning a ``normal’’ baseline from weakly supervised, healthy-only cohorts. This is a critical anomaly detection task for identifying disease, but standard models are often black boxes lacking an interpretable structure. We propose EVA-Net, a novel framework that recasts brain age as an interpretable anomaly detection problem. EVA-Net uses an efficient, sparsified-attention Transformer to model long EEG sequences. To handle noise and variability in imperfect data, it employs a Variational Information Bottleneck to learn a robust, compressed representation. For interpretability, this representation is aligned to a continuous prototype network that explicitly learns the normative healthy aging manifold. Trained on 1297 healthy subjects, EVA-Net achieves state-of-the-art accuracy. We validated its anomaly detection capabilities on an unseen cohort of 27 MCI and AD patients. This pathological group showed significantly higher brain-age gaps and a novel Prototype Alignment Error, confirming their deviation from the healthy manifold. EVA-Net provides an interpretable framework for healthcare intelligence using imperfect medical data.

[343] Proximal Approximate Inference in State-Space Models

Hany Abdulsamad, Ángel F. García-Fernández, Simo Särkkä

Main category: cs.LG

TL;DR: A variational Lagrangian approach for nonlinear, non-Gaussian state estimation using entropic trust-region updates with dynamic constraints.

DetailsMotivation: To develop efficient state estimation algorithms for nonlinear, non-Gaussian state-space models that overcome limitations of existing methods.

Method: Variational Lagrangian formulation with entropic trust-region updates, forward-backward algorithms based on posterior factorization, Gauss-Markov approximations, generalized statistical linear regression, and Fourier-Hermite moment matching.

Result: Derived recursive schemes with favorable computational complexity for nonlinear, non-Gaussian state estimation problems.

Conclusion: The proposed framework provides a principled approach to Bayesian inference in challenging nonlinear, non-Gaussian state-space models through variational methods and moment matching techniques.

Abstract: We present a class of algorithms for state estimation in nonlinear, non-Gaussian state-space models. Our approach is based on a variational Lagrangian formulation that casts Bayesian inference as a sequence of entropic trust-region updates subject to dynamic constraints. This framework gives rise to a family of forward-backward algorithms, whose structure is determined by the chosen factorization of the variational posterior. By focusing on Gauss–Markov approximations, we derive recursive schemes with favorable computational complexity. For general nonlinear, non-Gaussian models we close the recursions using generalized statistical linear regression and Fourier–Hermite moment matching.

[344] Towards Understanding Layer Contributions in Tabular In-Context Learning Models

Amir Rezaei Balef, Mykhailo Koshil, Katharina Eggensperger

Main category: cs.LG

TL;DR: Analyzes layer contributions in tabular ICL models, identifies redundant layers, and compares with LLMs using the “layers as painters” perspective.

DetailsMotivation: Despite architectural similarities between tabular ICL models and LLMs, little is known about how individual layers contribute to tabular prediction.

Method: Analyze TabPFN and TabICL through the “layers as painters” perspective to study latent space evolution and layer redundancy.

Result: Only subsets of layers share a common representational language, suggesting structural redundancy in tabular ICL models.

Conclusion: Findings offer opportunities for model compression and improved interpretability in tabular ICL models.

Abstract: Despite the architectural similarities between tabular in-context learning (ICL) models and large language models (LLMs), little is known about how individual layers contribute to tabular prediction. In this paper, we investigate how the latent spaces evolve across layers in tabular ICL models, identify potential redundant layers, and compare these dynamics with those observed in LLMs. We analyze TabPFN and TabICL through the “layers as painters” perspective, finding that only subsets of layers share a common representational language, suggesting structural redundancy and offering opportunities for model compression and improved interpretability.

[345] TSFM in-context learning for time-series classification of bearing-health status

Michel Tokic, Slobodan Djukanović, Anja von Beuningen, Cheng Feng

Main category: cs.LG

TL;DR: Classification method using in-context learning in time-series foundation models without fine-tuning, applied to bearing health assessment in servo-press motors.

DetailsMotivation: To enable classification of new data not in training corpus without model fine-tuning, moving beyond custom narrow AI solutions towards broader AI-driven maintenance systems.

Method: Transform frequency domain signals into pseudo time-series patterns, generate aligned covariate and target signals, use TSFM to predict classification probabilities through in-context learning with examples in prompts.

Result: Method demonstrates efficacy across varied operational conditions by leveraging pre-trained model scalability.

Conclusion: Significant progress beyond custom narrow AI solutions towards broader, AI-driven maintenance systems using in-context learning with time-series foundation models.

Abstract: This paper introduces a classification method using in-context learning in time-series foundation models (TSFM). We show how data, which was not part of the TSFM training data corpus, can be classified without the need of finetuning the model. Examples are represented in the form of targets (class id) and covariates (data matrix) within the prompt of the model, which enables to classify an unknown covariate data pattern alongside the forecast axis through in-context learning. We apply this method to vibration data for assessing the health state of a bearing within a servo-press motor. The method transforms frequency domain reference signals into pseudo time-series patterns, generates aligned covariate and target signals, and uses the TSFM to predict probabilities how classified data corresponds to predefined labels. Leveraging the scalability of pre-trained models this method demonstrates efficacy across varied operational conditions. This marks significant progress beyond custom narrow AI solutions towards broader, AI-driven maintenance systems.

[346] FairEnergy: Contribution-Based Fairness meets Energy Efficiency in Federated Learning

Ouiame Marnissi, Hajar EL Hammouti, El Houcine Bergou

Main category: cs.LG

TL;DR: FairEnergy is a fairness-aware energy minimization framework for federated learning that optimizes device selection, bandwidth allocation, and compression to achieve high accuracy while reducing energy consumption by up to 79%.

DetailsMotivation: Federated learning faces challenges in balancing energy efficiency, fair participation, and model accuracy in wireless edge systems due to heterogeneous resources, unequal client contributions, and limited communication capacity.

Method: Proposes FairEnergy framework that integrates contribution scores (considering update magnitude and compression ratio) into joint optimization. Solves mixed-integer non-convex problem via binary variable relaxation, Lagrangian decomposition, and per-device subproblem optimization.

Result: Experiments on non-IID data show FairEnergy achieves higher accuracy while reducing energy consumption by up to 79% compared to baseline strategies.

Conclusion: FairEnergy effectively addresses the trade-off between energy efficiency, fairness, and model accuracy in federated learning systems.

Abstract: Federated learning (FL) enables collaborative model training across distributed devices while preserving data privacy. However, balancing energy efficiency and fair participation while ensuring high model accuracy remains challenging in wireless edge systems due to heterogeneous resources, unequal client contributions, and limited communication capacity. To address these challenges, we propose FairEnergy, a fairness-aware energy minimization framework that integrates a contribution score capturing both the magnitude of updates and their compression ratio into the joint optimization of device selection, bandwidth allocation, and compression level. The resulting mixed-integer non-convex problem is solved by relaxing binary selection variables and applying Lagrangian decomposition to handle global bandwidth coupling, followed by per-device subproblem optimization. Experiments on non-IID data show that FairEnergy achieves higher accuracy while reducing energy consumption by up to 79% compared to baseline strategies.

[347] NTK-Guided Implicit Neural Teaching

Chen Zhang, Wei Zuo, Bingyang Cheng, Yikun Wang, Wei-Bin Kou, Yik Chung WU, Ngai Wong

Main category: cs.LG

TL;DR: NINT accelerates implicit neural representation training by dynamically selecting coordinates using neural tangent kernel guidance to maximize global functional updates, reducing training time by nearly half while maintaining quality.

DetailsMotivation: Implicit Neural Representations require optimizing over millions of coordinates for high-resolution signals, which incurs prohibitive computational costs that need to be addressed.

Method: Proposes NTK-Guided Implicit Neural Teaching (NINT) that dynamically selects coordinates using Neural Tangent Kernel to score examples by the norm of their NTK-augmented loss gradients, capturing both fitting errors and heterogeneous leverage.

Result: NINT significantly reduces training time by nearly half while maintaining or improving representation quality, establishing state-of-the-art acceleration among recent sampling-based strategies.

Conclusion: NINT provides an effective method for accelerating implicit neural representation training through intelligent coordinate selection guided by neural tangent kernel analysis.

Abstract: Implicit Neural Representations (INRs) parameterize continuous signals via multilayer perceptrons (MLPs), enabling compact, resolution-independent modeling for tasks like image, audio, and 3D reconstruction. However, fitting high-resolution signals demands optimizing over millions of coordinates, incurring prohibitive computational costs. To address it, we propose NTK-Guided Implicit Neural Teaching (NINT), which accelerates training by dynamically selecting coordinates that maximize global functional updates. Leveraging the Neural Tangent Kernel (NTK), NINT scores examples by the norm of their NTK-augmented loss gradients, capturing both fitting errors and heterogeneous leverage (self-influence and cross-coordinate coupling). This dual consideration enables faster convergence compared to existing methods. Through extensive experiments, we demonstrate that NINT significantly reduces training time by nearly half while maintaining or improving representation quality, establishing state-of-the-art acceleration among recent sampling-based strategies.

[348] Sample-Adaptivity Tradeoff in On-Demand Sampling

Nika Haghtalab, Omar Montasser, Mingda Qiao

Main category: cs.LG

TL;DR: The paper studies the tradeoff between sample complexity and round complexity in on-demand sampling for Multi-Distribution Learning, showing optimal sample complexity scaling and introducing the OODS framework.

DetailsMotivation: To understand the fundamental tradeoff between sample complexity and round complexity in adaptive sampling from multiple distributions, particularly in the context of Multi-Distribution Learning.

Method: Introduces the Optimization via On-Demand Sampling (OODS) framework that abstracts sample-adaptivity tradeoffs, analyzes round complexity bounds, and develops algorithms for both realizable and agnostic MDL settings.

Result: For realizable MDL, optimal sample complexity scales as dk^{Θ(1/r)}/ε; for agnostic MDL, achieves near-optimal sample complexity of Õ((d+k)/ε²) within Õ(√k) rounds; establishes tight bounds on round complexity in OODS framework.

Conclusion: The OODS framework captures most existing MDL algorithms and shows that achieving sub-polynomial round complexity would require fundamentally new techniques beyond the inherent hardness of OODS.

Abstract: We study the tradeoff between sample complexity and round complexity in on-demand sampling, where the learning algorithm adaptively samples from $k$ distributions over a limited number of rounds. In the realizable setting of Multi-Distribution Learning (MDL), we show that the optimal sample complexity of an $r$-round algorithm scales approximately as $dk^{Θ(1/r)} / ε$. For the general agnostic case, we present an algorithm that achieves near-optimal sample complexity of $\widetilde O((d + k) / ε^2)$ within $\widetilde O(\sqrt{k})$ rounds. Of independent interest, we introduce a new framework, Optimization via On-Demand Sampling (OODS), which abstracts the sample-adaptivity tradeoff and captures most existing MDL algorithms. We establish nearly tight bounds on the round complexity in the OODS setting. The upper bounds directly yield the $\widetilde O(\sqrt{k})$-round algorithm for agnostic MDL, while the lower bounds imply that achieving sub-polynomial round complexity would require fundamentally new techniques that bypass the inherent hardness of OODS.

[349] PCARNN-DCBF: Minimal-Intervention Geofence Enforcement for Ground Vehicles

Yinan Yu, Samuel Scheidegger

Main category: cs.LG

TL;DR: PCARNN-DCBF is a novel pipeline combining a physics-encoded control-affine neural network with preview-based discrete control barrier functions for runtime geofencing of ground vehicles, outperforming existing methods.

DetailsMotivation: Existing solutions struggle to reconcile high-fidelity learning with structural requirements for verifiable control in runtime geofencing for ground vehicles.

Method: Integrates Physics-encoded Control-Affine Residual Neural Network (PCARNN) with preview-based Discrete Control Barrier Function (DCBF), preserving control-affine structure of vehicle dynamics and enabling real-time Quadratic Program optimization.

Result: Experiments in CARLA across electric and combustion platforms show significant performance improvements over analytical and unstructured neural baselines.

Conclusion: The structure-preserving approach successfully enables reliable optimization and enforcement of polygonal keep-in constraints while handling high relative degree and mitigating actuator saturation.

Abstract: Runtime geofencing for ground vehicles is rapidly emerging as a critical technology for enforcing Operational Design Domains (ODDs). However, existing solutions struggle to reconcile high-fidelity learning with the structural requirements of verifiable control. We address this by introducing PCARNN-DCBF, a novel pipeline integrating a Physics-encoded Control-Affine Residual Neural Network with a preview-based Discrete Control Barrier Function. Unlike generic learned models, PCARNN explicitly preserves the control-affine structure of vehicle dynamics, ensuring the linearity required for reliable optimization. This enables the DCBF to enforce polygonal keep-in constraints via a real-time Quadratic Program (QP) that handles high relative degree and mitigates actuator saturation. Experiments in CARLA across electric and combustion platforms demonstrate that this structure-preserving approach significantly outperforms analytical and unstructured neural baselines.

[350] CODE: A global approach to ODE dynamics learning

Nils Wildt, Daniel M. Tartakovsky, Sergey Oladyshkin, Wolfgang Nowak

Main category: cs.LG

TL;DR: CODE (ChaosODE) uses Polynomial Chaos Expansion to learn ODE dynamics from sparse data, outperforming NeuralODE and KernelODE in extrapolation, especially with noise and scarce data.

DetailsMotivation: Traditional ODE modeling requires dense data, but real-world measurements are often sparse. Existing data-driven methods like NeuralODE and KernelODE struggle with extrapolation under noise and limited data.

Method: Proposes ChaosODE (CODE) using arbitrary Polynomial Chaos Expansion (aPCE) to create global orthonormal polynomial representations of ODE right-hand sides from sparse measurements.

Result: CODE shows superior extrapolation capabilities compared to NeuralODE and KernelODE, maintaining performance even with novel initial conditions, noise, and sparse data. Provides practical optimization guidelines.

Conclusion: Polynomial Chaos Expansion offers robust dynamics learning from sparse, noisy data, with better extrapolation than flexible neural/kernel methods that overfit under data scarcity.

Abstract: Ordinary differential equations (ODEs) are a conventional way to describe the observed dynamics of physical systems. Scientists typically hypothesize about dynamical behavior, propose a mathematical model, and compare its predictions to data. However, modern computing and algorithmic advances now enable purely data-driven learning of governing dynamics directly from observations. In data-driven settings, one learns the ODE’s right-hand side (RHS). Dense measurements are often assumed, yet high temporal resolution is typically both cumbersome and expensive. Consequently, one usually has only sparsely sampled data. In this work we introduce ChaosODE (CODE), a Polynomial Chaos ODE Expansion in which we use an arbitrary Polynomial Chaos Expansion (aPCE) for the ODE’s right-hand side, resulting in a global orthonormal polynomial representation of dynamics. We evaluate the performance of CODE in several experiments on the Lotka-Volterra system, across varying noise levels, initial conditions, and predictions far into the future, even on previously unseen initial conditions. CODE exhibits remarkable extrapolation capabilities even when evaluated under novel initial conditions and shows advantages compared to well-examined methods using neural networks (NeuralODE) or kernel approximators (KernelODE) as the RHS representer. We observe that the high flexibility of NeuralODE and KernelODE degrades extrapolation capabilities under scarce data and measurement noise. Finally, we provide practical guidelines for robust optimization of dynamics-learning problems and illustrate them in the accompanying code.

[351] Continual Reinforcement Learning for Cyber-Physical Systems: Lessons Learned and Open Challenges

Kim N. Nolle, Ivana Dusparic, Rhodri Cusack, Vinny Cahill

Main category: cs.LG

TL;DR: This paper identifies key challenges in continual reinforcement learning (CRL) through autonomous driving experiments, highlighting issues like catastrophic forgetting, hyperparameter sensitivity, and questions about neural network suitability for CL.

DetailsMotivation: To address the open problem of successfully applying continual learning to reinforcement learning, particularly in non-stationary environments like autonomous driving where agents need to adapt previously learned abilities to new tasks.

Method: Conducted experiments using Proximal Policy Optimisation (PPO) in an autonomous driving environment where agents learned to park in four different angled scenarios sequentially, representing a continual learning setup.

Result: Identified four major challenges: difficulty in finding suitable environment abstractions, oversensitivity to hyperparameters, catastrophic forgetting of previously learned tasks, and inefficient use of neural network capacity.

Conclusion: The challenges question the suitability of neural networks for continual learning and highlight the need for interdisciplinary research between computer science and neuroscience to create robust CRL systems.

Abstract: Continual learning (CL) is a branch of machine learning that aims to enable agents to adapt and generalise previously learned abilities so that these can be reapplied to new tasks or environments. This is particularly useful in multi-task settings or in non-stationary environments, where the dynamics can change over time. This is particularly relevant in cyber-physical systems such as autonomous driving. However, despite recent advances in CL, successfully applying it to reinforcement learning (RL) is still an open problem. This paper highlights open challenges in continual RL (CRL) based on experiments in an autonomous driving environment. In this environment, the agent must learn to successfully park in four different scenarios corresponding to parking spaces oriented at varying angles. The agent is successively trained in these four scenarios one after another, representing a CL environment, using Proximal Policy Optimisation (PPO). These experiments exposed a number of open challenges in CRL: finding suitable abstractions of the environment, oversensitivity to hyperparameters, catastrophic forgetting, and efficient use of neural network capacity. Based on these identified challenges, we present open research questions that are important to be addressed for creating robust CRL systems. In addition, the identified challenges call into question the suitability of neural networks for CL. We also identify the need for interdisciplinary research, in particular between computer science and neuroscience.

[352] DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

Cheng Yin, Yankai Lin, Wang Xu, Sikyuen Tam, Xiangrui Zeng, Zhiyuan Liu, Zhouping Yin

Main category: cs.LG

TL;DR: DeepThinkVLA introduces a hybrid-attention decoder architecture that separates Chain-of-Thought reasoning from action generation, using causal attention for reasoning and bidirectional attention for parallel action decoding, achieving 97.0% success rate on LIBERO benchmark.

DetailsMotivation: Existing VLA models struggle with the architectural conflict between sequential CoT reasoning and parallelizable robot actions, which degrades motor control and weakens the causal link between thought and action.

Method: Hybrid-attention decoder with causal attention for CoT reasoning and bidirectional attention for action vectors, plus two-stage training: SFT for foundational reasoning followed by RL with task-success rewards for causal alignment.

Result: Achieves state-of-the-art 97.0% success rate on LIBERO benchmark, with hybrid architecture alone outperforming standard decoders by 15.5% and RL providing additional 2% performance boost.

Conclusion: The integrated architecture and training strategy successfully resolves the fundamental conflict in VLA models, enabling effective “thinking before acting” through proper separation of reasoning and action generation.

Abstract: Enabling Vision-Language-Action (VLA) models to “think before acting” via Chain-of-Thought (CoT) is a promising path to overcoming the data-hungry nature of end-to-end robot policies. However, progress is stalled by a fundamental conflict: existing models use a single autoregressive decoder for both sequential CoT reasoning and high-dimensional, parallelizable robot actions. This architectural mismatch degrades motor control and fails to forge a strong causal link between thought and action. We introduce DeepThinkVLA, which resolves this conflict through a tightly integrated architecture and training strategy. Architecturally, our hybrid-attention decoder generates sequential CoT with causal attention and then switches to bidirectional attention for fast, parallel decoding of action vectors. This design is complemented by a two-stage training pipeline: we first use Supervised Fine-Tuning (SFT) to teach the model foundational reasoning, then apply Reinforcement Learning (RL) with task-success rewards to causally align the full reasoning-action sequence with desired outcomes. This synergy leads to state-of-the-art performance, achieving a 97.0% success rate on the LIBERO benchmark. Our ablations confirm the design’s effectiveness: the hybrid architecture alone outperforms standard decoders by 15.5%, and the final RL stage provides a crucial 2% boost to secure top performance.

[353] Walrus: A Cross-Domain Foundation Model for Continuum Dynamics

Michael McCabe, Payel Mukhopadhyay, Tanya Marwah, Bruno Regaldo-Saint Blancard, Francois Rozet, Cristiana Diaconu, Lucas Meyer, Kaze W. K. Wong, Hadi Sotoudeh, Alberto Bietti, Irina Espejo, Rio Fear, Siavash Golkar, Tom Hehir, Keiya Hirashima, Geraud Krawezik, Francois Lanusse, Rudy Morel, Ruben Ohana, Liam Parker, Mariel Pettee, Jeff Shen, Kyunghyun Cho, Miles Cranmer, Shirley Ho

Main category: cs.LG

TL;DR: Walrus is a transformer-based foundation model for fluid-like continuum dynamics that addresses challenges in physical simulation through stabilization methods, distributed training strategies, and adaptive tokenization, achieving superior performance across diverse domains.

DetailsMotivation: Foundation models have revolutionized language and vision domains, but face challenges in physical simulation due to data heterogeneity, unstable long-term dynamics, and varying resolutions/dimensionalities that hinder efficient training.

Method: Developed Walrus using harmonic-analysis-based stabilization, load-balanced distributed 2D/3D training strategies, and compute-adaptive tokenization. Pretrained on 19 diverse scenarios across astrophysics, geoscience, rheology, plasma physics, acoustics, and classical fluids.

Result: Walrus outperforms prior foundation models on both short and long-term prediction horizons across downstream tasks and pretraining data. Ablation studies confirm improvements in forecast stability, training throughput, and transfer performance.

Conclusion: The proposed approaches successfully mitigate obstacles in physical simulation foundation models, with Walrus demonstrating superior performance and the code/weights being released for community use.

Abstract: Foundation models have transformed machine learning for language and vision, but achieving comparable impact in physical simulation remains a challenge. Data heterogeneity and unstable long-term dynamics inhibit learning from sufficiently diverse dynamics, while varying resolutions and dimensionalities challenge efficient training on modern hardware. Through empirical and theoretical analysis, we incorporate new approaches to mitigate these obstacles, including a harmonic-analysis-based stabilization method, load-balanced distributed 2D and 3D training strategies, and compute-adaptive tokenization. Using these tools, we develop Walrus, a transformer-based foundation model developed primarily for fluid-like continuum dynamics. Walrus is pretrained on nineteen diverse scenarios spanning astrophysics, geoscience, rheology, plasma physics, acoustics, and classical fluids. Experiments show that Walrus outperforms prior foundation models on both short and long term prediction horizons on downstream tasks and across the breadth of pretraining data, while ablation studies confirm the value of our contributions to forecast stability, training throughput, and transfer performance over conventional approaches. Code and weights are released for community use.

[354] The Impact of Quantization on Large Reasoning Model Reinforcement Learning

Medha Kumar, Zifei Xu, Xin Wang, Tristan Webb

Main category: cs.LG

TL;DR: Quantization-aware RL training hurts reasoning performance in large models, while post-training quantization and QLoRA work better.

DetailsMotivation: To understand how quantization impacts reinforcement learning in large reasoning models, since existing quantization methods are well-studied for fine-tuning but not for RL.

Method: Conducted systematic experiments comparing post-RL quantization vs quantization-aware RL training on mathematical benchmarks.

Result: Found significant performance gap: quantization-aware RL training negatively impacted learning, while PTQ and QLoRA achieved better results.

Conclusion: Quantization-aware RL training is detrimental to reasoning capabilities, whereas post-training methods like PTQ and QLoRA are more effective for quantized large reasoning models.

Abstract: Strong reasoning capabilities can now be achieved by large-scale reinforcement learning (RL) without any supervised fine-tuning. Although post-training quantization (PTQ) and quantization-aware training (QAT) are well studied in the context of fine-tuning, how quantization impacts RL in large reasoning models (LRMs) remains an open question. To answer this question, we conducted systematic experiments and discovered a significant gap in reasoning performance on mathematical benchmarks between post-RL quantized models and their quantization-aware RL optimized counterparts. Our findings suggest that quantization-aware RL training negatively impacted the learning process, whereas PTQ and QLoRA led to greater performance.

[355] Planning-Aware Code Infilling via Horizon-Length Prediction

Yifeng Ding, Hantian Ding, Shiqi Wang, Qing Sun, Varun Kumar, Zijian Wang

Main category: cs.LG

TL;DR: Proposes Horizon-Length Prediction (HLP) to improve Fill-in-the-Middle (FIM) training by teaching models to predict remaining middle tokens, enhancing planning and alignment with context.

DetailsMotivation: Current FIM training with next-token prediction struggles to generate content that aligns well with surrounding context, especially for planning with distant right context.

Method: Introduces Horizon-Length Prediction (HLP) objective that predicts number of remaining middle tokens at each step, enabling lookahead planning without dataset-specific post-processing.

Result: HLP improves FIM performance by up to 24% across diverse benchmarks and model sizes, enhances code reasoning, with negligible training overhead and no inference cost.

Conclusion: HLP effectively addresses FIM limitations by teaching models infilling boundaries and planning, making it practical for real-world code generation scenarios.

Abstract: Fill-in-the-Middle (FIM), or infilling, has become integral to code language models, enabling generation of missing code given both left and right contexts. However, the current FIM training paradigm which performs next-token prediction (NTP) over reordered sequence often leads to models struggling to generate content that aligns well with the surrounding context. We hypothesize that NTP alone is insufficient for models to learn effective planning conditioned on the distant right context, a critical factor for successful code infilling. To overcome this, we propose Horizon-Length Prediction (HLP), a novel training objective that teaches models to predict the number of remaining middle tokens at each step. HLP advances FIM with lookahead planning, enabling models to inherently learn infilling boundaries for arbitrary left and right contexts without relying on dataset-specific post-processing. Our evaluation across different model families and sizes shows that HLP significantly improves FIM performance by up to 24% relatively on diverse benchmarks, across file-level and repository-level. Furthermore, the enhanced planning capability gained through HLP boosts model performance on code reasoning. Importantly, HLP incurs negligible training overhead and no additional inference cost, ensuring its practicality for real-world scenarios.

[356] Privacy Preserving In-Context-Learning Framework for Large Language Models

Bishnu Bhusal, Manoj Acharya, Ramneet Kaur, Colin Samplawski, Anirban Roy, Adam D. Cobb, Rohit Chadha, Susmit Jha

Main category: cs.LG

TL;DR: A novel private prediction framework for generating high-quality synthetic text with strong privacy guarantees using Differential Privacy, without requiring model fine-tuning.

DetailsMotivation: Address privacy concerns in LLMs where adversaries can extract sensitive information from prompts, ensuring protection against information leakage.

Method: Leverages Differential Privacy framework, performs inference on private records, aggregates per-token output distributions, and uses blending operation to combine private and public inference.

Result: Outperforms previous state-of-the-art methods on in-context-learning tasks, generates longer and coherent synthetic text while maintaining privacy guarantees.

Conclusion: Promising direction for privacy-preserving text generation that maintains high utility without compromising privacy.

Abstract: Large language models (LLMs) have significantly transformed natural language understanding and generation, but they raise privacy concerns due to potential exposure of sensitive information. Studies have highlighted the risk of information leakage, where adversaries can extract sensitive information embedded in the prompts. In this work, we introduce a novel private prediction framework for generating high-quality synthetic text with strong privacy guarantees. Our approach leverages the Differential Privacy (DP) framework to ensure worst-case theoretical bounds on information leakage without requiring any fine-tuning of the underlying models. The proposed method performs inference on private records and aggregates the resulting per-token output distributions. This enables the generation of longer and coherent synthetic text while maintaining privacy guarantees. Additionally, we propose a simple blending operation that combines private and public inference to further enhance utility. Empirical evaluations demonstrate that our approach outperforms previous state-of-the-art methods on in-context-learning (ICL) tasks, making it a promising direction for privacy-preserving text generation while maintaining high utility. Our code is available at https://github.com/bhusalb/privacy-preserving-icl.

[357] Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start

Kun Chen, Peng Shi, Haibo Qiu, Zhixiong Zeng, Siqi Yang, Wenji Mao, Lin Ma

Main category: cs.LG

TL;DR: SPECS is a self-distilled preference-based cold start framework that replaces SFT with preference learning to improve generalization before RL training, achieving significant performance gains on multimodal benchmarks.

DetailsMotivation: SFT-based cold start in RL for vision language models causes instruction-style overfitting and weakens out-of-distribution generalization, negatively impacting downstream RL performance.

Method: Proposes SPECS framework with three steps: (1) self-distilled preference data generation, (2) preference-based training focusing on surface-form criteria, and (3) handoff to RL with verifiable rewards for deep reasoning.

Result: Improves MEGA-Bench by 4.1% and MathVista by 12.2%, reduces in-distribution “stuckness,” improves exploration, stabilizes training, and raises performance ceiling.

Conclusion: Preference-based cold start methods generalize better than SFT-based approaches, and the decoupled learning framework in SPECS provides consistent performance gains across multimodal benchmarks.

Abstract: Reinforcement learning (RL) with verifiable rewards has recently catalyzed a wave of “MLLM-r1” approaches that bring RL to vision language models. Most representative paradigms begin with a cold start, typically employing supervised fine-tuning (SFT), to initialize the policy before RL. However, SFT-based cold start adopts the reasoning paradigm intertwined with task solution and output format, which may induce instruction-style overfitting, weakens out-of-distribution generalization, and ultimately affects downstream RL. We revisit the cold start along two views, its training method and data construction, and introduce the Generalization Factor (GF) coefficient to quantify the generalization capability under different methods. Our empirical study finds that preference-based training methods (e.g. DPO) generalizes better than SFT-based methods in cold start. Motivated by this, we propose SPECS-a Self-distilled, Preference-based Cold Start framework that decouples multimodal learning: (1) generates introspective preference data pairs via self-distillation, avoiding reliance on larger teachers or manual annotation; (2) performs preference-based training to learn, focusing on shallow, transferable surface-form criteria (format, structure, style) rather than memorizing content; and (3) hands off to RL with verifiable rewards for deep reasoning results. Experimental results across multiple multimodal benchmarks show that our decoupling learning framework yields consistent performance gains over strong baselines, improving MEGA-Bench by 4.1% and MathVista by 12.2%. Additional experiments indicate that SPECS contributes to reducing in-distribution “stuckness,” improving exploration, stabilizing training, and raising the performance ceiling.

[358] Better LLM Reasoning via Dual-Play

Zhengxin Zhang, Chengyu Huang, Aochong Oliver Li, Claire Cardie

Main category: cs.LG

TL;DR: PasoDoble is a novel dual-play framework for LLMs that adversarially trains two models - a Proposer generating challenging questions and a Solver answering them - without external supervision, improving reasoning performance while addressing reward hacking and training instability.

DetailsMotivation: Current LLMs rely heavily on external supervision through RLVR, while adversarial learning via self-play offers an alternative to reduce this dependency. Dual-play enables sustained competition between specialized models but adapting it to LLMs faces challenges like reward hacking and training instability.

Method: PasoDoble trains two models from the same base: a Proposer generates questions with ground-truth answers enriched by pre-training data, and a Solver attempts to solve them. The Proposer is rewarded for valid challenging questions, the Solver for correct answers, with joint updates. An optional offline paradigm decouples updates for stability.

Result: Experimental results demonstrate that PasoDoble can improve the reasoning performance of LLMs, operating successfully without supervision during training.

Conclusion: PasoDoble presents an effective dual-play framework that enables LLMs to improve through adversarial self-training, addressing key challenges like reward hacking and instability while reducing reliance on external supervision.

Abstract: Large Language Models (LLMs) have achieved remarkable progress through Reinforcement Learning with Verifiable Rewards (RLVR), yet still rely heavily on external supervision (e.g., curated labels). Adversarial learning, particularly through self-play, offers a promising alternative that enables models to iteratively learn from themselves - thus reducing reliance on external supervision. Dual-play extends adversarial learning by assigning specialized roles to two models and training them against each other, fostering sustained competition and mutual evolution. Despite its promise, adapting dual-play training to LLMs remains limited, largely due to their susceptibility to reward hacking and training instability. In this paper, we introduce PasoDoble, a novel LLM dual-play framework. PasoDoble adversarially trains two models initialized from the same base model: a Proposer, which generates challenging questions with ground-truth answers, and a Solver, which attempts to solve them. We enrich the Proposer with knowledge from a pre-training dataset to ensure the questions’ quality and diversity. To avoid reward hacking, the Proposer is rewarded for producing only valid questions that push the Solver’s limit, while the Solver is rewarded for solving them correctly, and both are updated jointly. To further enhance training stability, we introduce an optional offline paradigm that decouples Proposer and Solver updates, alternately updating each for several steps while holding the other fixed. Notably, PasoDoble operates without supervision during training. Experimental results show that PasoDoble can improve the reasoning performance of LLMs. Our project page is available at https://hcy123902.github.io/PasoDoble.

[359] Resource-Constrained Decentralized Federated Learning via Personalized Event-Triggering

Shahryar Zehtabi, Seyyedali Hosseinalipour, Christopher G. Brinton

Main category: cs.LG

TL;DR: Proposes asynchronous, event-triggered decentralized federated learning with personalized communication triggering to handle resource heterogeneity and statistical diversity in D2D networks.

DetailsMotivation: To enable federated learning in settings without central server access by using device-to-device communications, while addressing challenges of resource heterogeneity and statistical diversity.

Method: Asynchronous event-triggered communications with personalized triggering conditions that weigh local model parameter changes against available network resources, using cooperative consensus formation over D2D networks.

Result: Achieves O(lnk/√k) convergence rate to globally optimal model, with relaxed graph connectivity and data heterogeneity assumptions compared to existing literature, and demonstrates B-connected information flow guarantee.

Conclusion: The methodology provides substantial improvements in convergence speed and communication savings compared to existing decentralized FL baselines.

Abstract: Federated learning (FL) is a popular technique for distributing machine learning (ML) across a set of edge devices. In this paper, we study fully decentralized FL, where in addition to devices conducting training locally, they carry out model aggregations via cooperative consensus formation over device-to-device (D2D) networks. We introduce asynchronous, event-triggered communications among the devices to handle settings where access to a central server is not feasible. To account for the inherent resource heterogeneity and statistical diversity challenges in FL, we define personalized communication triggering conditions at each device that weigh the change in local model parameters against the available local network resources. We theoretically recover the $O(\ln{k} / \sqrt{k})$ convergence rate to the globally optimal model of decentralized gradient descent (DGD) methods in the setup of our methodology. We provide our convergence guarantees for the last iterates of models, under relaxed graph connectivity and data heterogeneity assumptions compared with the existing literature. To do so, we demonstrate a $B$-connected information flow guarantee in the presence of sporadic communications over the time-varying D2D graph. Our subsequent numerical evaluations demonstrate that our methodology obtains substantial improvements in convergence speed and/or communication savings compared to existing decentralized FL baselines.

[360] Optimal Control of Nonlinear Systems with Unknown Dynamics

Wenjian Hao, Paulo C. Heredia, Shaoshuai Mou

Main category: cs.LG

TL;DR: A data-driven method for finding optimal controllers for systems with unknown dynamics using Koopman operator and actor-critic framework.

DetailsMotivation: To develop optimal controllers for systems with unknown dynamics, overcoming the limitation of classical methods that require known system models.

Method: Combines Koopman operator with actor-critic framework to estimate gradients of cost function with respect to policy parameters, enabling gradient descent optimization.

Result: The method achieves optimal control performance comparable to classical methods but without requiring known system dynamics, as validated through simulations.

Conclusion: The proposed framework successfully enables data-driven optimal control for unknown systems and provides convergence guarantees.

Abstract: This paper presents a data-driven method to find a closed-loop optimal controller, which minimizes a specified infinite-horizon cost function for systems with unknown dynamics. Suppose the closed-loop optimal controller can be parameterized by a given class of functions, hereafter referred to as the policy. The proposed method introduces a novel gradient estimation framework, which approximates the gradient of the cost function with respect to the policy parameters via integrating the Koopman operator with the classical concept of actor-critic. This enables the policy parameters to be tuned iteratively using gradient descent to achieve an optimal controller, leveraging the linearity of the Koopman operator. The convergence analysis of the proposed framework is provided. The control performance of the proposed method is evaluated through simulations compared with classical optimal control methods that usually assume the dynamics are known.

[361] Detecting Out-of-Distribution Objects through Class-Conditioned Inpainting

Quang-Huy Nguyen, Jin Peng Zhou, Zhenzhen Liu, Khanh-Huyen Bui, Kilian Q. Weinberger, Wei-Lun Chao, Dung D. Le

Main category: cs.LG

TL;DR: Using text-to-image generative models to detect out-of-distribution objects by measuring reconstruction errors after class-conditioned inpainting.

DetailsMotivation: Real-world object detectors face challenges with novel objects not seen during training, and existing detectors are overconfident, making OOD detection unreliable using predictions alone.

Method: Leverage off-the-shelf text-to-image generative models (e.g., Stable Diffusion) to perform class-conditioned inpainting on detected object regions and measure reconstruction errors between original and inpainted images.

Result: The approach consistently outperforms existing zero-shot and non-zero-shot OOD detection methods across extensive experiments.

Conclusion: The method establishes a robust framework for enhancing object detection systems in dynamic environments by effectively identifying out-of-distribution objects through model inconsistency measurement.

Abstract: Recent object detectors have achieved impressive accuracy in identifying objects seen during training. However, real-world deployment often introduces novel and unexpected objects, referred to as out-of-distribution (OOD) objects, posing significant challenges to model trustworthiness. Modern object detectors are typically overconfident, making it unreliable to use their predictions alone for OOD detection. To address this, we propose leveraging an auxiliary model as a complementary solution. Specifically, we utilize an off-the-shelf text-to-image generative model, such as Stable Diffusion, which is trained with objective functions distinct from those of discriminative object detectors. We hypothesize that this fundamental difference enables the detection of OOD objects by measuring inconsistencies between the models. Concretely, for a given detected object bounding box and its predicted in-distribution class label, we perform class-conditioned inpainting on the image with the object removed. If the object is OOD, the inpainted image is likely to deviate significantly from the original, making the reconstruction error a robust indicator of OOD status. Extensive experiments demonstrate that our approach consistently surpasses existing zero-shot and non-zero-shot OOD detection methods, establishing a robust framework for enhancing object detection systems in dynamic environments.

[362] Distributed Event-Based Learning via ADMM

Guner Dilsad Er, Sebastian Trimpe, Michael Muehlebach

Main category: cs.LG

TL;DR: Event-triggered distributed learning algorithm that reduces communication by 35%+ while being robust to heterogeneous data distributions and communication failures, outperforming FedAvg, FedProx, SCAFFOLD and FedADMM.

DetailsMotivation: To address the high communication costs in distributed learning and enable convergence even when local data distributions across agents are arbitrarily different (non-IID data).

Method: Event-triggered communication strategy where agents only communicate when necessary, making the approach agnostic to data distribution among agents. Analyzed for both convex and nonconvex settings.

Result: Achieved 35% or more communication savings on MNIST and CIFAR-10 datasets, demonstrated resilience to heterogeneous data distributions, and showed superior performance compared to common baselines. Derived accelerated convergence rates for convex case and robustness to communication failures.

Conclusion: The proposed event-based communication strategy effectively reduces communication overhead while maintaining convergence guarantees and robustness to data heterogeneity and communication failures, making it suitable for practical distributed learning scenarios.

Abstract: We consider a distributed learning problem, where agents minimize a global objective function by exchanging information over a network. Our approach has two distinct features: (i) It substantially reduces communication by triggering communication only when necessary, and (ii) it is agnostic to the data-distribution among the different agents. We therefore guarantee convergence even if the local data-distributions of the agents are arbitrarily distinct. We analyze the convergence rate of the algorithm both in convex and nonconvex settings and derive accelerated convergence rates for the convex case. We also characterize the effect of communication failures and demonstrate that our algorithm is robust to these. The article concludes by presenting numerical results from distributed learning tasks on the MNIST and CIFAR-10 datasets. The experiments underline communication savings of 35% or more due to the event-based communication strategy, show resilience towards heterogeneous data-distributions, and highlight that our approach outperforms common baselines such as FedAvg, FedProx, SCAFFOLD and FedADMM.

[363] Explaining Time Series Classification Predictions via Causal Attributions

Juan Miguel Lopez Alcaraz, Nils Strodthoff

Main category: cs.LG

TL;DR: This paper introduces a novel model-agnostic attribution method for time series classification that assesses causal effects of concepts (predefined time series segments) on classification outcomes, comparing causal attributions with associational ones using diffusion models and state space models.

DetailsMotivation: Despite strong performance of machine learning models, understanding their decisions remains challenging. Current attribution methods from explainable AI rely on associational rather than causal relationships, limiting their interpretability and reliability.

Method: Proposed a model-agnostic attribution method using state-of-the-art diffusion models backed by state space models to estimate counterfactual outcomes and assess causal effects of concepts in time series classification.

Result: Causal and associational attributions often share similarities but differ in important details across diverse time series classification tasks, highlighting risks of drawing causal conclusions from associational data alone.

Conclusion: The approach demonstrates the limitations of associational attributions and the importance of causal analysis, with potential wide applicability across other domains beyond time series classification.

Abstract: Despite the excelling performance of machine learning models, understanding their decisions remains a long-standing goal. Although commonly used attribution methods from explainable AI attempt to address this issue, they typically rely on associational rather than causal relationships. In this study, within the context of time series classification, we introduce a novel model-agnostic attribution method to assess the causal effect of concepts i.e., predefined segments within a time series, on classification outcomes. Our approach compares these causal attributions with closely related associational attributions, both theoretically and empirically. To estimate counterfactual outcomes, we use state-of-the-art diffusion models backed by state space models. We demonstrate the insights gained by our approach for a diverse set of qualitatively different time series classification tasks. Although causal and associational attributions might often share some similarities, in all cases they differ in important details, underscoring the risks associated with drawing causal conclusions from associational data alone. We believe that the proposed approach is also widely applicable in other domains to shed some light on the limits of associational attributions.

[364] VeriFlow: Modeling Distributions for Neural Network Verification

Faried Abu Zaid, Daniel Neider, Mustafa Yalçıner

Main category: cs.LG

TL;DR: VeriFlow architecture enables neural network verification to focus on realistic input distributions using flow-based density models with piecewise affine transformations and computable upper density level sets.

DetailsMotivation: Current verification methods check neural networks on unrealistic inputs. The goal is to restrict verification to meaningful data distributions that occur in the real world.

Method: Proposed VeriFlow architecture as a flow-based density model with piecewise affine transformations, allowing verifiers to use linear arithmetic constraint solving and compute upper density level sets via linear constraints in latent space.

Result: The architecture enables effective verification with probabilistically interpretable control over input typicality, focusing verification on relevant data distributions rather than arbitrary inputs.

Conclusion: VeriFlow provides a principled approach to neural network verification that restricts analysis to meaningful input distributions while maintaining compatibility with existing verification techniques.

Abstract: Formal verification has emerged as a promising method to ensure the safety and reliability of neural networks. However, many relevant properties, such as fairness or global robustness, pertain to the entire input space. If one applies verification techniques naively, the neural network is checked even on inputs that do not occur in the real world and have no meaning. To tackle this shortcoming, we propose the VeriFlow architecture as a flow-based density model tailored to allow any verification approach to restrict its search to some data distribution of interest. We argue that our architecture is particularly well suited for this purpose because of two major properties. First, we show that the transformation that is defined by our model is piecewise affine. Therefore, the model allows the usage of verifiers based on constraint solving with linear arithmetic. Second, upper density level sets (UDL) of the data distribution are definable via linear constraints in the latent space. As a consequence, representations of UDLs specified by a given probability are effectively computable in the latent space. This property allows for effective verification with a fine-grained, probabilistically interpretable control of how a-typical the inputs subject to verification are.

[365] ExDAG: an MIQP Algorithm for Learning DAGs

Pavel Rytir, Ales Wodecki, Jakub Marecek

Main category: cs.LG

TL;DR: ExDAG: A novel mixed-integer quadratic programming algorithm for learning causal DAGs with guaranteed exact learning and real-time quality assessment, outperforming state-of-the-art methods on medium-sized graphs.

DetailsMotivation: Growing interest in causal learning and limitations of existing integer programming approaches for DAG identification due to super-exponential constraints preventing cycle formation.

Method: Mixed-integer quadratic programming formulation with branch-and-bound-and-cut algorithm using lazy constraints methodology to selectively generate only violated constraints, avoiding the super-exponential constraint problem.

Result: Empirical results show ExDAG outperforms state-of-the-art solvers in structural Hamming distance and F1 score on medium-sized graphs with Gaussian noise.

Conclusion: The proposed algorithm successfully circumvents scaling limitations of previous integer programming approaches and provides exact learning guarantees with real-time solution quality assessment.

Abstract: There has been a growing interest in causal learning in recent years. Commonly used representations of causal structures, including Bayesian networks and structural equation models (SEM), take the form of directed acyclic graphs (DAGs). We provide a novel mixed-integer quadratic programming formulation and an associated algorithm that identifies DAGs with a low structural Hamming distance between the identified DAG and the ground truth, under identifiability assumptions. The eventual exact learning is guaranteed by the global convergence of the branch-and-bound-and-cut algorithm, which is utilized. In addition to this, integer programming techniques give us access to the dual bound, which allows for a real time assessment of the quality of solution. Previously, integer programming techniques have been shown to lead to limited scaling in the case of DAG identification due to the super exponential number of constraints, which prevent the formation of cycles. The algorithm proposed circumvents this by selectively generating only the violated constraints using the so-called “lazy” constraints methodology. Our empirical results show that ExDAG outperforms state-of-the-art solvers in terms of structural Hamming distance and $F_1$ score when considering Gaussian noise on medium-sized graphs.

[366] Deep Learning and Machine Learning, Advancing Big Data Analytics and Management: Tensorflow Pretrained Models

Keyu Chen, Ziqian Bi, Qian Niu, Junyu Liu, Benji Peng, Sen Zhang, Ming Liu, Xinyuan Song, Zekun Jiang, Tianyang Wang, Ming Li, Xuanhe Pan, Jiawei Xu, Jinlang Wang, Pohsun Feng

Main category: cs.LG

TL;DR: Practical guide to using TensorFlow pre-trained models for image classification and object detection, comparing transfer learning methods with visualizations and providing complete code examples.

DetailsMotivation: To provide practical guidance for applying TensorFlow pre-trained models in real-world deep learning tasks, making transfer learning accessible to both beginners and advanced users.

Method: Explores modern architectures (ResNet, MobileNet, EfficientNet) and compares linear probing vs. fine-tuning using transfer learning, with visualizations through PCA, t-SNE, and UMAP.

Result: Demonstrates effectiveness of transfer learning through real-world examples and experiments, showing the impact of different approaches on model performance.

Conclusion: By integrating theory with hands-on practice, the paper provides complete tools and insights to efficiently address deep learning challenges using pre-trained models.

Abstract: The application of TensorFlow pre-trained models in deep learning is explored, with an emphasis on practical guidance for tasks such as image classification and object detection. The study covers modern architectures, including ResNet, MobileNet, and EfficientNet, and demonstrates the effectiveness of transfer learning through real-world examples and experiments. A comparison of linear probing and model fine-tuning is presented, supplemented by visualizations using techniques like PCA, t-SNE, and UMAP, allowing for an intuitive understanding of the impact of these approaches. The work provides complete example code and step-by-step instructions, offering valuable insights for both beginners and advanced users. By integrating theoretical concepts with hands-on practice, the paper equips readers with the tools necessary to address deep learning challenges efficiently.

[367] Revisiting Gradient Normalization and Clipping for Nonconvex SGD under Heavy-Tailed Noise: Necessity, Sufficiency, and Acceleration

Tao Sun, Xinwang Liu, Kun Yuan

Main category: cs.LG

TL;DR: Gradient normalization alone can ensure nonconvex SGD convergence under smoothness, and combined with clipping improves rates under heavy-tailed noise; also works for variance-reduced methods.

DetailsMotivation: Revisit the belief that gradient clipping is essential for SGD convergence with heavy-tailed noise, and explore if gradient normalization can be an effective alternative or complement.

Method: Theoretical analysis of gradient normalization, clipping, and their combination under smoothness assumptions; investigation of variance-reduced algorithms; development of accelerated variant under second-order smoothness.

Result: Proved gradient normalization alone suffices for nonconvex SGD convergence; combined with clipping yields better convergence rates under challenging noise; established normalization works for variance-reduced methods; presented accelerated variant with improved convergence.

Conclusion: Provides theoretical insights and practical guidance for using normalization and clipping in nonconvex optimization with heavy-tailed noise, showing normalization can be sufficient or complementary to clipping.

Abstract: Gradient clipping has long been considered essential for ensuring the convergence of Stochastic Gradient Descent (SGD) in the presence of heavy-tailed gradient noise. In this paper, we revisit this belief and explore whether gradient normalization can serve as an effective alternative or complement. We prove that, under individual smoothness assumptions, gradient normalization alone is sufficient to guarantee convergence of the nonconvex SGD. Moreover, when combined with clipping, it yields far better rates of convergence under more challenging noise distributions. We provide a unifying theory describing normalization-only, clipping-only, and combined approaches. Moving forward, we investigate existing variance-reduced algorithms, establishing that, in such a setting, normalization alone is sufficient for convergence. Finally, we present an accelerated variant that under second-order smoothness improves convergence. Our results provide theoretical insights and practical guidance for using normalization and clipping in nonconvex optimization with heavy-tailed noise.

[368] xLSTM-Mixer: Multivariate Time Series Forecasting by Mixing via Scalar Memories

Maurice Kraus, Felix Divo, Devendra Singh Dhami, Kristian Kersting

Main category: cs.LG

TL;DR: xLSTM-Mixer combines recurrent models with mixing architectures for superior time series forecasting, achieving state-of-the-art performance with minimal memory requirements.

DetailsMotivation: Time series data is prevalent across many fields, requiring robust forecasting models that can capture patterns within and between temporal and multivariate components for reliable predictions.

Method: xLSTM-Mixer integrates temporal sequences, joint time-variate information, and multiple perspectives through a linear forecast refined by xLSTM blocks, reconciling two distinct views for final forecasting.

Result: Extensive evaluations show superior long-term forecasting performance compared to recent state-of-the-art methods while requiring very little memory.

Conclusion: This work contributes to the resurgence of recurrent models in forecasting by combining them with mixing architectures for the first time, demonstrating robustness and effectiveness.

Abstract: Time series data is prevalent across numerous fields, necessitating the development of robust and accurate forecasting models. Capturing patterns both within and between temporal and multivariate components is crucial for reliable predictions. We introduce xLSTM-Mixer, a model designed to effectively integrate temporal sequences, joint time-variate information, and multiple perspectives for robust forecasting. Our approach begins with a linear forecast shared across variates, which is then refined by xLSTM blocks. They serve as key elements for modeling the complex dynamics of challenging time series data. xLSTM-Mixer ultimately reconciles two distinct views to produce the final forecast. Our extensive evaluations demonstrate its superior long-term forecasting performance compared to recent state-of-the-art methods while requiring very little memory. A thorough model analysis provides further insights into its key components and confirms its robustness and effectiveness. This work contributes to the resurgence of recurrent models in forecasting by combining them, for the first time, with mixing architectures.

[369] RIZE: Adaptive Regularization for Imitation Learning

Adib Karimi, Mohammad Mehdi Ebadzadeh

Main category: cs.LG

TL;DR: Novel IRL method with adaptive TD regularization and distributional RL that achieves expert-level performance on complex environments with limited demonstrations.

DetailsMotivation: To overcome the rigidity of fixed reward structures and limited flexibility of implicit reward regularization in Inverse Reinforcement Learning.

Method: Builds on Maximum Entropy IRL with squared temporal-difference regularizer and adaptive targets that evolve dynamically during training, integrated with distributional RL to capture richer return information.

Result: Achieves expert-level performance on complex MuJoCo and Adroit environments, surpassing baseline methods on Humanoid-v2 task with limited expert demonstrations.

Conclusion: The approach effectively mitigates reward rigidity and promotes robust decision-making, with extensive experiments validating its effectiveness and providing insights into reward dynamics in imitation learning.

Abstract: We propose a novel Inverse Reinforcement Learning (IRL) method that mitigates the rigidity of fixed reward structures and the limited flexibility of implicit reward regularization. Building on the Maximum Entropy IRL framework, our approach incorporates a squared temporal-difference (TD) regularizer with adaptive targets that evolve dynamically during training, thereby imposing adaptive bounds on recovered rewards and promoting robust decision-making. To capture richer return information, we integrate distributional RL into the learning process. Empirically, our method achieves expert-level performance on complex MuJoCo and Adroit environments, surpassing baseline methods on the Humanoid-v2 task with limited expert demonstrations. Extensive experiments and ablation studies further validate the effectiveness of the approach and provide insights into reward dynamics in imitation learning. Our source code is available at https://github.com/adibka/RIZE.

[370] Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion

Evgeniia Vu, Andrei Boiarov, Dmitry Vetrov

Main category: cs.LG

TL;DR: A novel streaming gesture generation framework using Rolling Diffusion with progressive noise scheduling for real-time co-speech gesture synthesis, achieving 4x speedup while maintaining quality.

DetailsMotivation: Real-time co-speech gesture generation requires both temporal coherence and efficient sampling, which existing methods struggle to achieve simultaneously.

Method: Extends Rolling Diffusion models with structured progressive noise scheduling and introduces Rolling Diffusion Ladder Acceleration (RDLA) for simultaneous multi-frame denoising.

Result: Outperforms state-of-the-art baselines on ZEGGS and BEAT datasets, achieves 4x speedup with high visual fidelity and temporal coherence, validated by user studies.

Conclusion: The framework provides a generalizable and efficient solution for real-time co-speech gesture synthesis that maintains realism, diversity and audio synchronization.

Abstract: Generating co-speech gestures in real time requires both temporal coherence and efficient sampling. We introduce a novel framework for streaming gesture generation that extends Rolling Diffusion models with structured progressive noise scheduling, enabling seamless long-sequence motion synthesis while preserving realism and diversity. Our framework is universally compatible with existing diffusion-based gesture generation model, transforming them into streaming methods capable of continuous generation without requiring post-processing. We evaluate our framework on ZEGGS and BEAT, strong benchmarks for real-world applicability. Applied to state-of-the-art baselines on both datasets, it consistently outperforms them, demonstrating its effectiveness as a generalizable and efficient solution for real-time co-speech gesture synthesis. We further propose Rolling Diffusion Ladder Acceleration (RDLA), a new approach that employs a ladder-based noise scheduling strategy to simultaneously denoise multiple frames. This significantly improves sampling efficiency while maintaining motion consistency, achieving up to a 4x speedup with high visual fidelity and temporal coherence in our experiments. Comprehensive user studies further validate our framework ability to generate realistic, diverse gestures closely synchronized with the audio input.

[371] Measuring the (Un)Faithfulness of Concept-Based Explanations

Shubham Kumar, Narendra Ahuja

Main category: cs.LG

TL;DR: The paper identifies flaws in current unsupervised concept-based explanation methods (U-CBEMs) and proposes SURF, a new faithfulness evaluation framework that uses simple linear surrogates and comprehensive metrics to properly assess explanation fidelity.

DetailsMotivation: Current U-CBEMs report improved faithfulness but this improvement is artificial due to either using overly complex surrogates that reduce interpretability or relying on deletion-based approaches that don't properly measure faithfulness.

Method: Proposes SURF framework that: (1) replaces complex surrogates with simple linear surrogates to measure faithfulness without changing interpretability, (2) introduces metrics that assess loss across all output classes, not just the predicted class, and (3) validates with a sanity check using random concepts.

Result: SURF enables the first reliable faithfulness benchmark for U-CBEMs, revealing that many visually compelling U-CBEMs are not actually faithful to the model’s internal computations.

Conclusion: The proposed SURF framework provides a more reliable way to evaluate concept-based explanations, exposing that current methods’ reported faithfulness improvements are artificial and that many visually appealing explanations lack true fidelity to model computations.

Abstract: Deep vision models perform input-output computations that are hard to interpret. Concept-based explanation methods (CBEMs) increase interpretability by re-expressing parts of the model with human-understandable semantic units, or concepts. Checking if the derived explanations are faithful – that is, they represent the model’s internal computation – requires a surrogate that combines concepts to compute the output. Simplifications made for interpretability inevitably reduce faithfulness, resulting in a tradeoff between the two. State-of-the-art unsupervised CBEMs (U-CBEMs) have reported increasingly interpretable concepts, while also being more faithful to the model. However, we observe that the reported improvement in faithfulness artificially results from either (1) using overly complex surrogates, which introduces an unmeasured cost to the explanation’s interpretability, or (2) relying on deletion-based approaches that, as we demonstrate, do not properly measure faithfulness. We propose Surrogate Faithfulness (SURF), which (1) replaces prior complex surrogates with a simple, linear surrogate that measures faithfulness without changing the explanation’s interpretability and (2) introduces well-motivated metrics that assess loss across all output classes, not just the predicted class. We validate SURF with a measure-over-measure study by proposing a simple sanity check – explanations with random concepts should be less faithful – which prior surrogates fail. SURF enables the first reliable faithfulness benchmark of U-CBEMs, revealing that many visually compelling U-CBEMs are not faithful. Code to be released.

[372] Quantitative Attractor Analysis of High-Capacity Kernel Logistic Regression Hopfield Networks

Akira Tamamori

Main category: cs.LG

TL;DR: Kernel-based learning methods like KLR and KRR dramatically increase Hopfield network storage capacity, with linear scaling (P ∝ N) under optimized kernel width scaling where γ×N increases with network size.

DetailsMotivation: Kernel-based learning methods can significantly boost Hopfield network storage capacity, but their performance principles and stability remain poorly understood, limiting systematic design and application.

Method: Comprehensive quantitative analysis of attractor landscapes in KLR-trained networks through extensive, statistically validated simulations, comparing KLR and KRR performance across different network sizes and parameters.

Result: KLR and KRR show similarly high storage capacities and clean attractor landscapes, with linear scaling (P ∝ N) when kernel width γ is scaled such that γ×N increases with network size. Performance is robust to regularization parameter lambda.

Conclusion: Kernel regression methods overcome classical Hopfield network limitations through proper kernel width scaling, providing empirical principles for designing high-capacity, robust associative memories.

Abstract: Kernel-based learning methods such as Kernel Logistic Regression (KLR) can dramatically increase the storage capacity of Hopfield networks, but the principles governing their performance and stability remain largely uncharacterized. This paper presents a comprehensive quantitative analysis of the attractor landscape in KLR-trained networks to establish a solid foundation for their design and application. Through extensive, statistically validated simulations, we address critical questions of generality, scalability, and robustness. Our comparative analysis reveals that KLR and Kernel Ridge Regression (KRR) exhibit similarly high storage capacities and clean attractor landscapes, suggesting this is a general property of kernel regression methods, though KRR is computationally much faster. We uncover a non-trivial, scale-dependent scaling law for the kernel width ($γ$), demonstrating that optimal capacity requires gamma to be scaled such that $γ\times N$ increases with network size $N$. This implies that larger networks necessitate more localized kernels – where each pattern’s influence is more spatially confined–to manage inter-pattern interference. Under this optimized scaling, we provide definitive evidence that the storage capacity scales linearly with network size ($P \propto N$). Furthermore, our sensitivity analysis shows that performance is remarkably robust to the choice of the regularization parameter lambda. Collectively, these findings provide a clear set of empirical principles for designing high-capacity, robust associative memories and clarify the mechanisms that enable kernel methods to overcome the classical limitations of Hopfield-type models.

[373] OODTE: A Differential Testing Engine for the ONNX Optimizer

Nikolaos Louloudakis, Ajitha Rajan

Main category: cs.LG

TL;DR: OODTE is a tool that automatically evaluates the correctness of ONNX Optimizer by detecting accuracy issues through differential testing, revealing that 9.2% of models crash or become invalid and 30% of classification models show output discrepancies.

DetailsMotivation: The ONNX Optimizer is widely used but its ability to maintain model accuracy during optimization has not been thoroughly investigated, creating potential reliability concerns.

Method: OODTE uses differential testing methodology - it takes ONNX models, applies optimizations, executes both original and optimized versions across input sets, and when discrepancies occur, iteratively isolates the responsible optimization pass.

Result: Evaluation of 130 models revealed: 9.2% caused crashes or invalid models; 30% of classification models and 16.6% of object detection/segmentation models showed output differences; text-related models were robust; 15 issues (14 new) affecting 9 of 47 optimization passes were found.

Conclusion: OODTE provides an effective framework for validating AI model optimizers, uncovering significant accuracy issues in the widely-used ONNX Optimizer and demonstrating applicability beyond the ONNX ecosystem.

Abstract: With over 760 stars on GitHub and being part of the official ONNX repository, the ONNX Optimizer is the default tool for applying graph-based optimizations to ONNX models. Despite its widespread use, its ability to maintain model accuracy during optimization has not been thoroughly investigated. In this work, we present OODTE, a utility designed to automatically and comprehensively evaluate the correctness of the ONNX Optimizer. OODTE adopts a straightforward yet powerful differential testing and evaluation methodology, which can be readily adapted for use with other compiler optimizers. Specifically, OODTE takes a collection of ONNX models, applies optimizations, and executes both the original and optimized versions across a user-defined input set, automatically capturing any issues encountered during optimization. When discrepancies in accuracy arise, OODTE iteratively isolates the responsible optimization pass by repeating the process at a finer granularity. We applied OODTE to 130 well-known models from the official ONNX Model Hub, spanning diverse tasks including classification, object detection, semantic segmentation, text summarization, question answering, and sentiment analysis. Our evaluation revealed that 9.2% of the model instances either caused the optimizer to crash or led to the generation of invalid models using default optimization strategies. Additionally, 30% of classification models and 16.6% of object detection and segmentation models exhibited differing outputs across original and optimized versions, whereas models focused on text-related tasks were generally robust to optimization. OODTE uncovered 15 issues-14 previously unknown-affecting 9 of 47 optimization passes and the optimizer overall. All issues were reported to the ONNX Optimizer team. OODTE offers a simple but effective framework for validating AI model optimizers, applicable beyond the ONNX ecosystem.

[374] Put CASH on Bandits: A Max K-Armed Problem for Automated Machine Learning

Amir Rezaei Balef, Claire Vernade, Katharina Eggensperger

Main category: cs.LG

TL;DR: MaxUCB is a max k-armed bandit method for CASH problems that efficiently trades off exploring model classes and hyperparameter optimization, designed for light-tailed reward distributions in AutoML.

DetailsMotivation: CASH (Combined Algorithm Selection and Hyperparameter optimization) is a challenging resource allocation problem in AutoML that requires balancing exploration of different model classes with hyperparameter optimization.

Method: Proposed MaxUCB, a max k-armed bandit method specifically designed for light-tailed and bounded reward distributions in CASH problems, providing an alternative to classic methods assuming heavy-tailed distributions.

Result: The method was theoretically and empirically evaluated on four standard AutoML benchmarks, demonstrating superior performance over prior approaches.

Conclusion: MaxUCB provides an efficient solution for CASH problems in AutoML, outperforming existing methods while being specifically adapted to the reward distribution characteristics of this setting.

Abstract: The Combined Algorithm Selection and Hyperparameter optimization (CASH) is a challenging resource allocation problem in the field of AutoML. We propose MaxUCB, a max k-armed bandit method to trade off exploring different model classes and conducting hyperparameter optimization. MaxUCB is specifically designed for the light-tailed and bounded reward distributions arising in this setting and, thus, provides an efficient alternative compared to classic max k-armed bandit methods assuming heavy-tailed reward distributions. We theoretically and empirically evaluate our method on four standard AutoML benchmarks, demonstrating superior performance over prior approaches. We make our code and data available at https://github.com/amirbalef/CASH_with_Bandits

[375] ZENN: A Thermodynamics-Inspired Computational Framework for Heterogeneous Data-Driven Modeling

Shun Wang, Shun-Li Shang, Zi-Kui Liu, Wenrui Hao

Main category: cs.LG

TL;DR: ZENN extends zentropy theory to machine learning, enabling better learning from heterogeneous datasets by simultaneously modeling energy and intrinsic entropy components with a learnable temperature variable for superior generalization.

DetailsMotivation: Address challenges in integrating heterogeneous datasets with intrinsic disparities by extending zentropy theory from physics to data science, enabling more effective learning from multi-source data.

Method: Introduce zentropy-enhanced neural network (ZENN) that simultaneously learns energy and intrinsic entropy components, with redesigned architecture and learnable temperature variable to model latent multi-source heterogeneity.

Result: ZENN surpasses state-of-the-art models on CIFAR-10/100, BBCNews, and AGNews, and successfully reconstructs Helmholtz energy landscape of Fe3Pt capturing negative thermal expansion and critical points in temperature-pressure space.

Conclusion: ZENN provides a zentropy-grounded framework for data-driven machine learning, offering versatile and robust approach for scientific problems involving complex, heterogeneous datasets.

Abstract: Traditional entropy-based methods - such as cross-entropy loss in classification problems - have long been essential tools for representing the information uncertainty and physical disorder in data and for developing artificial intelligence algorithms. However, the rapid growth of data across various domains has introduced new challenges, particularly the integration of heterogeneous datasets with intrinsic disparities. To address this, we introduce a zentropy-enhanced neural network (ZENN), extending zentropy theory into the data science domain via intrinsic entropy, enabling more effective learning from heterogeneous data sources. ZENN simultaneously learns both energy and intrinsic entropy components, capturing the underlying structure of multi-source data. To support this, we redesign the neural network architecture to better reflect the intrinsic properties and variability inherent in diverse datasets. We demonstrate the effectiveness of ZENN on classification tasks and energy landscape reconstructions, showing its superior generalization capabilities and robustness-particularly in predicting high-order derivatives. ZENN demonstrates superior generalization by introducing a learnable temperature variable that models latent multi-source heterogeneity, allowing it to surpass state-of-the-art models on CIFAR-10/100, BBCNews, and AGNews. As a practical application in materials science, we employ ZENN to reconstruct the Helmholtz energy landscape of Fe$_3$Pt using data generated from density functional theory (DFT) and capture key material behaviors, including negative thermal expansion and the critical point in the temperature-pressure space. Overall, this work presents a zentropy-grounded framework for data-driven machine learning, positioning ZENN as a versatile and robust approach for scientific problems involving complex, heterogeneous datasets.

[376] TI-DeepONet: Learnable Time Integration for Stable Long-Term Extrapolation

Dibyajyoti Nayak, Somdatta Goswami

Main category: cs.LG

TL;DR: TI-DeepONet integrates neural operators with adaptive numerical time-stepping to enable accurate long-term temporal extrapolation of dynamical systems, shifting from direct state prediction to approximating instantaneous time-derivative fields.

DetailsMotivation: Conventional DeepONet approaches face limitations in temporal extrapolation due to either ignoring temporal causality (fixed-horizon rollouts) or accumulating errors (autoregressive schemes), creating a need for physics-aware operator learning that preserves Markovian structure.

Method: The framework learns instantaneous time-derivative fields instead of direct states, then integrates using standard numerical solvers. TI(L)-DeepONet extends this with learnable coefficients for intermediate slopes in multi-stage integration, adapting to solution-specific dynamics.

Result: Across four canonical PDEs, TI(L)-DeepONet slightly outperforms TI-DeepONet, with both achieving ~81% reduction in relative L2 error compared to autoregressive methods and ~70% reduction compared to fixed-horizon approaches, maintaining stable predictions over nearly twice the training interval.

Conclusion: This work establishes a physics-aware operator learning framework that bridges neural approximation with numerical analysis principles, effectively addressing long-term forecasting challenges in complex physical systems.

Abstract: Accurate temporal extrapolation remains a fundamental challenge for neural operators modeling dynamical systems, where predictions must extend far beyond the training horizon. Conventional DeepONet approaches rely on two limited paradigms: fixed-horizon rollouts, which predict full spatiotemporal solutions while ignoring temporal causality, and autoregressive schemes, which accumulate errors through sequential prediction. We introduce TI-DeepONet, a framework that integrates neural operators with adaptive numerical time-stepping to preserve the Markovian structure of dynamical systems while mitigating long-term error growth. Our method shifts the learning objective from direct state prediction to approximating instantaneous time-derivative fields, which are then integrated using standard numerical solvers. This naturally enables continuous-time prediction and allows the use of higher-order integrators at inference than those used in training, improving both efficiency and accuracy. We further propose TI(L)-DeepONet, which incorporates learnable coefficients for intermediate slopes in multi-stage integration, adapting to solution-specific dynamics and enhancing fidelity. Across four canonical PDEs featuring chaotic, dissipative, dispersive, and high-dimensional behavior, TI(L)-DeepONet slightly outperforms TI-DeepONet, and both achieve major reductions in relative L2 extrapolation error: about 81% compared to autoregressive methods and 70% compared to fixed-horizon approaches. Notably, both models maintain stable predictions over temporal domains nearly twice the training interval. This work establishes a physics-aware operator learning framework that bridges neural approximation with numerical analysis principles, addressing a key gap in long-term forecasting of complex physical systems.

[377] Turb-L1: Achieving Long-term Turbulence Tracing By Tackling Spectral Bias

Hao Wu, Yuan Gao, Chang Liu, Fan Xu, Fan Zhang, Zhihong Zhu, Yuqi Li, Xian Wu, Yuxuan Liang, Li Liu, Qingsong Wen, Kun Wang, Yu Zheng, Xiaomeng Huang

Main category: cs.LG

TL;DR: Turb-L1 overcomes spectral bias in turbulence prediction using hierarchical dynamics synthesis, achieving 80.3% MSE reduction and 9x SSIM improvement over SOTA methods.

DetailsMotivation: Existing deep learning methods fail in long-term turbulence prediction due to excessive smoothing and inability to track complex fluid dynamics, with spectral bias identified as the core obstacle.

Method: Proposes Turb-L1 with Hierarchical Dynamics Synthesis mechanism in a multi-grid architecture to explicitly overcome spectral bias and capture cross-scale interactions while preserving high-frequency dynamics.

Result: Reduces MSE by 80.3% and increases SSIM by over 9x compared to SOTA baseline; accurately reproduces full enstrophy spectrum and maintains physical realism in high-wavenumber regions.

Conclusion: Turb-L1 effectively overcomes spectral bias, enabling reliable long-term tracking of turbulence evolution while avoiding spectral distortions and spurious energy accumulation seen in other methods.

Abstract: Accurately predicting the long-term evolution of turbulence is crucial for advancing scientific understanding and optimizing engineering applications. However, existing deep learning methods face significant bottlenecks in long-term autoregressive prediction, which exhibit excessive smoothing and fail to accurately track complex fluid dynamics. Our extensive experimental and spectral analysis of prevailing methods provides an interpretable explanation for this shortcoming, identifying Spectral Bias as the core obstacle. Concretely, spectral bias is the inherent tendency of models to favor low-frequency, smooth features while overlooking critical high-frequency details during training, thus reducing fidelity and causing physical distortions in long-term predictions. Building on this insight, we propose Turb-L1, an innovative turbulence prediction method, which utilizes a Hierarchical Dynamics Synthesis mechanism within a multi-grid architecture to explicitly overcome spectral bias. It accurately captures cross-scale interactions and preserves the fidelity of high-frequency dynamics, enabling reliable long-term tracking of turbulence evolution. Extensive experiments on the 2D turbulence benchmark show that Turb-L1 demonstrates excellent performance: (I) In long-term predictions, it reduces Mean Squared Error (MSE) by $80.3%$ and increases Structural Similarity (SSIM) by over $9\times$ compared to the SOTA baseline, significantly improving prediction fidelity. (II) It effectively overcomes spectral bias, accurately reproducing the full enstrophy spectrum and maintaining physical realism in high-wavenumber regions, thus avoiding the spectral distortions or spurious energy accumulation seen in other methods.

[378] Energy-based generator matching: A neural sampler for general state space

Dongyeop Woo, Minsu Kim, Minkyu Kim, Kiyoung Seong, Sungsoo Ahn

Main category: cs.LG

TL;DR: EGM trains generative models from energy functions without data, supporting continuous-time Markov processes like diffusion, flow, and jump models across continuous, discrete, and mixed modalities.

DetailsMotivation: To enable training of generative models directly from energy functions when data is unavailable, extending generator matching to handle arbitrary continuous-time Markov processes across different data modalities.

Method: Uses self-normalized importance sampling with a bootstrapping trick to estimate generator matching loss, reducing variance in importance weights for training.

Result: Validated on discrete and multimodal tasks up to 100 and 20 dimensions respectively, demonstrating effectiveness across different data types and dimensions.

Conclusion: EGM provides a versatile, modality-agnostic framework for training generative models from energy functions without requiring actual data samples.

Abstract: We propose Energy-based generator matching (EGM), a modality-agnostic approach to train generative models from energy functions in the absence of data. Extending the recently proposed generator matching, EGM enables training of arbitrary continuous-time Markov processes, e.g., diffusion, flow, and jump, and can generate data from continuous, discrete, and a mixture of two modalities. To this end, we propose estimating the generator matching loss using self-normalized importance sampling with an additional bootstrapping trick to reduce variance in the importance weight. We validate EGM on both discrete and multimodal tasks up to 100 and 20 dimensions, respectively.

[379] Learning in Compact Spaces with Approximately Normalized Transformer

Jörg K. H. Franke, Urs Spiegelhalter, Marianna Nezhurina, Jenia Jitsev, Frank Hutter, Michael Hefenbrock

Main category: cs.LG

TL;DR: The paper proposes approximate normalization via scalar multiplications to address training challenges in deep neural networks, eliminating the need for weight decay and learning rate warm-up while achieving faster convergence.

DetailsMotivation: To overcome challenges like overfitting, numerical instabilities, and variance in residual streams without requiring additional hyperparameter tuning for regularization and normalization techniques.

Method: Uses approximate normalization via simple scalar multiplications based on the concentration of norms of high-dimensional random vectors, and constrains parameter norms instead of applying strict normalization.

Result: Experiments with transformer architectures show up to 40% faster convergence compared to GPT models with QK normalization, with only 3% additional runtime cost. Enables training with larger batch sizes while preserving scaling characteristics.

Conclusion: The proposed holistic normalization approach effectively addresses training challenges without increasing normalization layers, providing faster convergence and better scalability while maintaining computational efficiency.

Abstract: The successful training of deep neural networks requires addressing challenges such as overfitting, numerical instabilities leading to divergence, and increasing variance in the residual stream. A common solution is to apply regularization and normalization techniques that usually require tuning additional hyperparameters. An alternative is to force all parameters and representations to lie on a hypersphere. This removes the need for regularization and increases convergence speed, but comes with additional costs. In this work, we propose a more holistic, approximate normalization via simple scalar multiplications motivated by the tight concentration of the norms of high-dimensional random vectors. Additionally, instead of applying strict normalization for the parameters, we constrain their norms. These modifications remove the need for weight decay and learning rate warm-up as well, but do not increase the total number of normalization layers. Our experiments with transformer architectures show up to 40% faster convergence compared to GPT models with QK normalization, with only 3% additional runtime cost. When deriving scaling laws, we found that our method enables training with larger batch sizes while preserving the favorable scaling characteristics of classic GPT architectures.

[380] A Unified Framework for Provably Efficient Algorithms to Estimate Shapley Values

Tyler Chen, Akshay Seshadri, Mattia J. Villani, Pradeep Niroula, Shouvanik Chakrabarti, Archan Ray, Pranav Deshpande, Romina Yalovetzky, Marco Pistoia, Niraj Kumar

Main category: cs.LG

TL;DR: This paper provides the first theoretical guarantees for KernelSHAP and introduces a unified framework for Shapley value estimators with improved scalability and performance.

DetailsMotivation: KernelSHAP is widely used but lacks theoretical guarantees, while other estimators have bounds but may not scale well. There's a need for a unified framework with proven theoretical foundations.

Method: Developed a unified framework encompassing KernelSHAP and related estimators using with/without replacement sampling, proved non-asymptotic theoretical guarantees, and implemented scalability improvements for high-dimensional datasets.

Result: Achieved low mean squared error with modest sample sizes on Decision-Tree models, and demonstrated better performance than KernelSHAP library on MNIST and CIFAR10 datasets.

Conclusion: The framework provides the first theoretical guarantees for KernelSHAP, enables better understanding of estimator tradeoffs, and offers improved scalability and performance for high-dimensional datasets.

Abstract: Shapley values have emerged as a critical tool for explaining which features impact the decisions made by machine learning models. However, computing exact Shapley values is difficult, generally requiring an exponential (in the feature dimension) number of model evaluations. To address this, many model-agnostic randomized estimators have been developed, the most influential and widely used being the KernelSHAP method (Lundberg & Lee, 2017). While related estimators such as unbiased KernelSHAP (Covert & Lee, 2021) and LeverageSHAP (Musco & Witter, 2025) are known to satisfy theoretical guarantees, bounds for KernelSHAP have remained elusive. We describe a broad and unified framework that encompasses KernelSHAP and related estimators constructed using both with and without replacement sampling strategies. We then prove strong non-asymptotic theoretical guarantees that apply to all estimators from our framework. This provides, to the best of our knowledge, the first theoretical guarantees for KernelSHAP and sheds further light on tradeoffs between existing estimators. Through comprehensive benchmarking on small and medium dimensional datasets for Decision-Tree models, we validate our approach against exact Shapley values, consistently achieving low mean squared error with modest sample sizes. Furthermore, we make specific implementation improvements to enable scalability of our methods to high-dimensional datasets. Our methods, tested on datasets such MNIST and CIFAR10, provide consistently better results compared to the KernelSHAP library.

[381] Global Convergence of Adjoint-Optimized Neural PDEs

Konstantin Riedl, Justin Sirignano, Konstantinos Spiliopoulos

Main category: cs.LG

TL;DR: The paper studies convergence of adjoint gradient descent for training neural PDE models, proving global convergence to target data despite non-convex optimization challenges in the infinite-width limit.

DetailsMotivation: Many fields need to model PDE terms with neural networks to approximate missing physics, requiring solving inverse problems from observed data. Neural PDE models are important in scientific machine learning but lack theoretical convergence guarantees.

Method: Uses adjoint gradient descent optimization for training neural PDE models, analyzing convergence in the limit where both hidden units and training time tend to infinity. Studies nonlinear parabolic PDEs with neural networks embedded in source terms.

Result: Proves convergence of trained neural-network PDE solution to target data (global minimizer) despite non-local neural network kernel operator and nonlinear PDE system leading to non-convex optimization in the infinite-width limit.

Conclusion: Theoretical convergence is established for neural PDE models even when the optimization remains non-convex in the infinite-width limit, unlike typical neural network cases where optimization becomes convex. Numerical studies validate the theoretical results.

Abstract: Many engineering and scientific fields have recently become interested in modeling terms in partial differential equations (PDEs) with neural networks, which requires solving the inverse problem of learning neural network terms from observed data in order to approximate missing or unresolved physics in the PDE model. The resulting neural-network PDE model, being a function of the neural network parameters, can be calibrated to the available ground truth data by optimizing over the PDE using gradient descent, where the gradient is evaluated in a computationally efficient manner by solving an adjoint PDE. These neural PDE models have emerged as an important research area in scientific machine learning. In this paper, we study the convergence of the adjoint gradient descent optimization method for training neural PDE models in the limit where both the number of hidden units and the training time tend to infinity. Specifically, for a general class of nonlinear parabolic PDEs with a neural network embedded in the source term, we prove convergence of the trained neural-network PDE solution to the target data (i.e., a global minimizer). The global convergence proof poses a unique mathematical challenge that is not encountered in finite-dimensional neural network convergence analyses due to (i) the neural network training dynamics involving a non-local neural network kernel operator in the infinite-width hidden layer limit where the kernel lacks a spectral gap for its eigenvalues and (ii) the nonlinearity of the limit PDE system, which leads to a non-convex optimization problem in the neural network function even in the infinite-width hidden layer limit (unlike in typical neural network training cases where the optimization problem becomes convex in the large neuron limit). The theoretical results are illustrated and empirically validated by numerical studies.

[382] Few-shot Class-incremental Fault Diagnosis by Preserving Class-Agnostic Knowledge with Dual-Granularity Representations

Zhendong Yang, Jie Wang, Liansong Zong, Xiaorong Liu, Quan Qian, Shiqian Chen

Main category: cs.LG

TL;DR: Proposes Dual-Granularity Guidance Network (DGGN) for Few-Shot Class-Incremental Fault Diagnosis, using dual representations and cross-attention to prevent forgetting and overfitting.

DetailsMotivation: Address catastrophic forgetting and overfitting in Few-Shot Class-Incremental Fault Diagnosis, where systems must continuously learn new fault classes with limited samples without forgetting old knowledge.

Method: Uses dual-granularity representations: fine-grained stream with Multi-Order Interaction Aggregation for class-specific features, and coarse-grained stream for class-agnostic knowledge. Features fused via multi-semantic cross-attention, with Boundary-Aware Exemplar Prioritization and decoupled Balanced Random Forest classifier.

Result: Extensive experiments on TEP benchmark and real-world MFF dataset show superior diagnostic performance and stability compared to state-of-the-art FSC-FD approaches.

Conclusion: DGGN effectively addresses catastrophic forgetting and overfitting in few-shot incremental fault diagnosis through dual-granularity representations and guided feature learning.

Abstract: Few-Shot Class-Incremental Fault Diagnosis (FSC-FD), which aims to continuously learn from new fault classes with only a few samples without forgetting old ones, is critical for real-world industrial systems. However, this challenging task severely amplifies the issues of catastrophic forgetting of old knowledge and overfitting on scarce new data. To address these challenges, this paper proposes a novel framework built upon Dual-Granularity Representations, termed the Dual-Granularity Guidance Network (DGGN). Our DGGN explicitly decouples feature learning into two parallel streams: 1) a fine-grained representation stream, which utilizes a novel Multi-Order Interaction Aggregation module to capture discriminative, class-specific features from the limited new samples. 2) a coarse-grained representation stream, designed to model and preserve general, class-agnostic knowledge shared across all fault types. These two representations are dynamically fused by a multi-semantic cross-attention mechanism, where the stable coarse-grained knowledge guides the learning of fine-grained features, preventing overfitting and alleviating feature conflicts. To further mitigate catastrophic forgetting, we design a Boundary-Aware Exemplar Prioritization strategy. Moreover, a decoupled Balanced Random Forest classifier is employed to counter the decision boundary bias caused by data imbalance. Extensive experiments on the TEP benchmark and a real-world MFF dataset demonstrate that our proposed DGGN achieves superior diagnostic performance and stability compared to state-of-the-art FSC-FD approaches. Our code is publicly available at https://github.com/MentaY/DGGN

[383] RTNinja: A generalized machine learning framework for analyzing random telegraph noise signals in nanoelectronic devices

Anirudh Varanasi, Robin Degraeve, Philippe Roussel, Clement Merckling

Main category: cs.LG

TL;DR: RTNinja is an automated machine learning framework for unsupervised analysis of random telegraph noise in nanoelectronics, using Bayesian inference and probabilistic clustering to identify hidden noise sources without prior knowledge.

DetailsMotivation: Conventional analysis techniques for random telegraph noise rely on restrictive assumptions or manual interventions, limiting their applicability to complex, noisy datasets in nanoelectronic devices where this noise critically impacts reliability and performance.

Method: RTNinja comprises two modular components: LevelsExtractor (using Bayesian inference and model selection for denoising and discretization) and SourcesMapper (using probabilistic clustering and optimization to infer source configurations). Performance was evaluated using a Monte Carlo simulator generating 7000 labeled datasets.

Result: RTNinja consistently demonstrated high-fidelity signal reconstruction and accurate extraction of source amplitudes and activity patterns across diverse signal-to-noise ratios and source complexities.

Conclusion: RTNinja offers a robust, scalable, and device-agnostic tool for random telegraph noise characterization, enabling large-scale statistical benchmarking, reliability qualification, predictive failure modeling, and device physics exploration in next-generation nanoelectronics.

Abstract: Random telegraph noise is a prevalent variability phenomenon in nanoelectronic devices, arising from stochastic carrier exchange at defect sites and critically impacting device reliability and performance. Conventional analysis techniques often rely on restrictive assumptions or manual interventions, limiting their applicability to complex, noisy datasets. Here, we introduce RTNinja, a generalized, fully automated machine learning framework for the unsupervised analysis of random telegraph noise signals. RTNinja deconvolves complex signals to identify the number and characteristics of hidden individual sources without requiring prior knowledge of the system. The framework comprises two modular components: LevelsExtractor, which uses Bayesian inference and model selection to denoise and discretize the signal, and SourcesMapper, which infers source configurations through probabilistic clustering and optimization. To evaluate performance, we developed a Monte Carlo simulator that generates labeled datasets spanning broad signal-to-noise ratios and source complexities; across 7000 such datasets, RTNinja consistently demonstrated high-fidelity signal reconstruction and accurate extraction of source amplitudes and activity patterns. Our results demonstrate that RTNinja offers a robust, scalable, and device-agnostic tool for random telegraph noise characterization, enabling large-scale statistical benchmarking, reliability-centric technology qualification, predictive failure modeling, and device physics exploration in next-generation nanoelectronics.

[384] MAP Estimation with Denoisers: Convergence Rates and Guarantees

Scott Pesme, Giacomo Meanti, Michael Arbel, Julien Mairal

Main category: cs.LG

TL;DR: The paper provides theoretical justification for using denoisers as proximal operators in MAP optimization, showing convergence under log-concavity assumptions.

DetailsMotivation: Existing methods use pretrained denoisers as surrogates for proximal operators in MAP optimization without theoretical justification, creating a gap between practice and theory.

Method: The authors analyze a simple algorithm related to practical methods, interpreting it as gradient descent on smoothed proximal objectives under log-concavity assumptions on the prior.

Result: The algorithm provably converges to the proximal operator when the prior distribution is log-concave, providing theoretical foundation for empirical methods.

Conclusion: This work bridges the theory-practice gap by establishing theoretical guarantees for using denoisers as proximal operators in MAP optimization problems.

Abstract: Denoiser models have become powerful tools for inverse problems, enabling the use of pretrained networks to approximate the score of a smoothed prior distribution. These models are often used in heuristic iterative schemes aimed at solving Maximum a Posteriori (MAP) optimisation problems, where the proximal operator of the negative log-prior plays a central role. In practice, this operator is intractable, and practitioners plug in a pretrained denoiser as a surrogate-despite the lack of general theoretical justification for this substitution. In this work, we show that a simple algorithm, closely related to several used in practice, provably converges to the proximal operator under a log-concavity assumption on the prior $p$. We show that this algorithm can be interpreted as a gradient descent on smoothed proximal objectives. Our analysis thus provides a theoretical foundation for a class of empirically successful but previously heuristic methods.

[385] Differentiable Entropy Regularization: A Complexity-Aware Approach for Neural Optimization

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

Main category: cs.LG

TL;DR: Differentiable approximation of range-partition entropy enables complexity regularization for neural networks, achieving significant speedups in vision transformers and LLMs while maintaining accuracy and improving robustness.

DetailsMotivation: To develop a complementary regularizer that provides orthogonal efficiency gains by directly minimizing representation complexity, unlike architectural modifications or output distribution regularization methods.

Method: Introduces the first differentiable approximation of range-partition entropy from computational geometry, used as a complexity regularizer that can be combined with existing optimizations like FlashAttention.

Result: Achieves 4-5× provable speedups on computational geometry problems, 2.07× speedup on ImageNet-1K with ViT-Base, and 1.48-1.60× inference speedups on LLMs at 70-75% sparsity with minimal quality degradation.

Conclusion: Complexity regularization offers a principled pathway to joint efficiency-robustness optimization, with strongest benefits for geometry and vision transformers, and measurable gains on LLMs through semantically structured sparsity patterns.

Abstract: We introduce the first differentiable approximation of range-partition entropy, a complexity measure from computational geometry that directly bounds algorithmic runtime. Unlike architectural modifications, our method is a complementary regularizer that provides orthogonal efficiency gains when combined with existing optimizations. We establish theoretical guarantees in computational geometry, achieving 4–5$\times$ provable speedups on convex hull and triangulation with $<$0.2% error. On ImageNet-1K with ViT-Base, entropy regularization achieves 80.1% top-1 accuracy at 80% sparsity (1.60$\times$ standalone speedup), and when combined with FlashAttention yields 2.07$\times$ speedup versus 1.63$\times$ for FlashAttention alone. On large language models (LLaMA-2 7B, Mistral-7B, Phi-2), we achieve 1.48–1.60$\times$ inference speedups at 70–75% sparsity with minimal quality degradation (ROUGE-L drops of 0.3–0.4 points, perplexity increase of 0.9). Unlike prior regularization methods that target output distributions, we directly minimize representation complexity, yielding both efficiency gains and improved robustness through semantically structured sparsity patterns (IoU 0.73 vs 0.41 for magnitude pruning, CIFAR-100-C mCE 48.7 vs 55.4). Benefits are strongest for geometry and vision transformers, with more modest but measurable gains on LLMs, demonstrating that complexity regularization offers a principled pathway to joint efficiency-robustness optimization.

[386] Self-Supervised Temporal Super-Resolution of Energy Data using Generative Adversarial Transformer

Xuanhao Mu, Gökhan Demirel, Yuzhe Zhang, Jianlei Liu, Thorsten Schlachter, Veit Hagenmeyer

Main category: cs.LG

TL;DR: This paper introduces a Generative Adversarial Transformers (GATs) method for time series upsampling in energy systems that can be trained without ground-truth high-resolution data, achieving 10% RMSE reduction and 13% MPC accuracy improvement over conventional methods.

DetailsMotivation: To address the temporal granularity gap in energy network design and operation, where conventional upsampling methods cause information loss/noise, and advanced models face application paradoxes requiring unavailable high-resolution data for training.

Method: Proposes Generative Adversarial Transformers (GATs) that can be trained without access to ground-truth high-resolution data, overcoming the fundamental application paradox of supervised learning approaches.

Result: The method reduces root mean square error (RMSE) by 10% compared to conventional interpolation methods and improves model predictive control (MPC) application accuracy by 13%.

Conclusion: The GATs approach successfully addresses the upsampling challenge in energy systems by enabling training without high-resolution ground-truth data while significantly outperforming conventional methods.

Abstract: To bridge the temporal granularity gap in energy network design and operation based on Energy System Models, resampling of time series is required. While conventional upsampling methods are computationally efficient, they often result in significant information loss or increased noise. Advanced models such as time series generation models, Super-Resolution models and imputation models show potential, but also face fundamental challenges. The goal of time series generative models is to learn the distribution of the original data to generate high-resolution series with similar statistical characteristics. This is not entirely consistent with the definition of upsampling. Time series Super-Resolution models or imputation models can degrade the accuracy of upsampling because the input low-resolution time series are sparse and may have insufficient context. Moreover, such models usually rely on supervised learning paradigms. This presents a fundamental application paradox: their training requires the high-resolution time series that is intrinsically absent in upsampling application scenarios. To address the mentioned upsampling issue, this paper introduces a new method utilizing Generative Adversarial Transformers (GATs), which can be trained without access to any ground-truth high-resolution data. Compared with conventional interpolation methods, the introduced method can reduce the root mean square error (RMSE) of upsampling tasks by 10%, and the accuracy of a model predictive control (MPC) application scenario is improved by 13%.

[387] Coresets from Trajectories: Selecting Data via Correlation of Loss Differences

Manish Nagaraj, Deepak Ravikumar, Kaushik Roy

Main category: cs.LG

TL;DR: CLD is a scalable coreset selection method that uses loss trajectory alignment with validation data to identify impactful training samples, achieving efficient performance without costly gradient computations.

DetailsMotivation: Deep learning models face scalability challenges in real-time or resource-constrained scenarios, requiring efficient methods to select the most important training samples.

Method: Proposes Correlation of Loss Differences (CLD) metric that measures alignment between training samples’ loss trajectories and a held-out validation set, requiring only per-sample loss values from training checkpoints.

Result: CLD-based coresets outperform or match state-of-the-art methods on CIFAR-100 and ImageNet-1k, remain within 1% of expensive baselines, transfer effectively across architectures with <1% degradation, and provide inherent bias reduction.

Conclusion: CLD is a principled, efficient, stable, and transferable tool for scalable dataset optimization that avoids costly computations while maintaining strong performance.

Abstract: Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Correlation of Loss Differences (CLD), a simple and scalable metric for coreset selection that identifies the most impactful training samples by measuring their alignment with the loss trajectories of a held-out validation set. CLD is highly efficient, requiring only per-sample loss values computed at training checkpoints, and avoiding the costly gradient and curvature computations used in many existing subset selection methods. We develop a general theoretical framework that establishes convergence guarantees for CLD-based coresets, demonstrating that the convergence error is upper-bounded by the alignment of the selected samples and the representativeness of the validation set. On CIFAR-100 and ImageNet-1k, CLD-based coresets typically outperform or closely match state-of-the-art methods across subset sizes, and remain within 1% of more computationally expensive baselines even when not leading. CLD transfers effectively across architectures (ResNet, VGG, DenseNet), enabling proxy-to-target selection with <1% degradation. Moreover, CLD is stable when using only early checkpoints, incurring negligible accuracy loss. Finally, CLD exhibits inherent bias reduction via per-class validation alignment, obviating the need for additional stratified sampling. Together, these properties make CLD a principled, efficient, stable, and transferable tool for scalable dataset optimization.

[388] Optimizing In-Context Learning for Efficient Full Conformal Prediction

Weicao Deng, Sangwoo Park, Min Li, Osvaldo Simeone

Main category: cs.LG

TL;DR: E-ICL+FCP is an enhanced in-context learning framework for full conformal prediction that uses a permutation-invariant Transformer trained with CP-aware loss to simulate multiple retrained models, achieving better efficiency-coverage trade-offs than existing methods.

DetailsMotivation: Current conformal prediction methods face complementary limitations: split CP is data inefficient due to dataset partitioning, while full CP improves data efficiency but has prohibitive retraining complexity. Existing meta-learning and ICL approaches don't specifically optimize for CP, leading to large prediction sets.

Method: Enhanced ICL-based FCP framework using a permutation-invariant Transformer-based ICL model trained with a CP-aware loss to simulate multiple retrained models required by full CP without actual retraining.

Result: Experiments on synthetic and real tasks show E-ICL+FCP achieves superior efficiency-coverage trade-offs compared to existing SCP and FCP baselines, preserving coverage while reducing inefficiency and computational overhead.

Conclusion: E-ICL+FCP provides an efficient full conformal prediction framework that overcomes the limitations of both split and full CP variants, delivering reliable uncertainty quantification with improved data efficiency and reduced computational cost.

Abstract: Reliable uncertainty quantification is critical for trustworthy AI. Conformal Prediction (CP) provides prediction sets with distribution-free coverage guarantees, but its two main variants face complementary limitations. Split CP (SCP) suffers from data inefficiency due to dataset partitioning, while full CP (FCP) improves data efficiency at the cost of prohibitive retraining complexity. Recent approaches based on meta-learning or in-context learning (ICL) partially mitigate these drawbacks. However, they rely on training procedures not specifically tailored to CP, which may yield large prediction sets. We introduce an efficient FCP framework, termed enhanced ICL-based FCP (E-ICL+FCP), which employs a permutation-invariant Transformer-based ICL model trained with a CP-aware loss. By simulating the multiple retrained models required by FCP without actual retraining, E-ICL+FCP preserves coverage while markedly reducing both inefficiency and computational overhead. Experiments on synthetic and real tasks demonstrate that E-ICL+FCP attains superior efficiency-coverage trade-offs compared to existing SCP and FCP baselines.

[389] Selective Risk Certification for LLM Outputs via Information-Lift Statistics: PAC-Bayes, Robustness, and Skeleton Design

Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma

Main category: cs.LG

TL;DR: Information-lift certificates use sub-gamma PAC-Bayes bounds to provide formal uncertainty quantification with abstention guarantees, outperforming baselines by blocking 96% of critical errors.

DetailsMotivation: Large language models often produce confident but incorrect outputs, creating a critical need for reliable uncertainty quantification with formal abstention guarantees.

Method: Introduces information-lift certificates that compare model probabilities to a skeleton baseline, accumulating evidence through sub-gamma PAC-Bayes bounds that remain valid under heavy-tailed distributions.

Result: Achieves 77.0% coverage at 2% risk on eight diverse datasets, outperforming recent baselines by 10.0 percentage points. Blocks 96% of critical errors in high-stakes scenarios compared to 18-31% for entropy-based methods.

Conclusion: While frequency-based certification doesn’t guarantee severity-weighted safety and depends on skeleton quality, performance degrades gracefully under distributional shifts, making the approach practical for real-world deployment.

Abstract: Large language models often produce confident but incorrect outputs, creating a critical need for reliable uncertainty quantification with formal abstention guarantees. We introduce information-lift certificates that compare model probabilities to a skeleton baseline, accumulating evidence through sub-gamma PAC-Bayes bounds that remain valid under heavy-tailed distributions where standard concentration inequalities fail. On eight diverse datasets, our method achieves 77.0% coverage at 2% risk, outperforming recent baselines by 10.0 percentage points on average. In high-stakes scenarios, we block 96% of critical errors compared to 18-31% for entropy-based methods. While our frequency-based certification does not guarantee severity-weighted safety and depends on skeleton quality, performance degrades gracefully under distributional shifts, making the approach practical for real-world deployment.

[390] MMG: Mutual Information Estimation via the MMSE Gap in Diffusion

Longxuan Yu, Xing Shi, Xianghao Kong, Tong Jia, Greg Ver Steeg

Main category: cs.LG

TL;DR: Diffusion models can estimate mutual information via the MMSE gap between conditional and unconditional diffusion, integrated over all SNRs, outperforming traditional methods.

DetailsMotivation: Mutual information is fundamental but hard to estimate for complex systems, and diffusion models excel at density estimation, suggesting they could improve MI estimation.

Method: Use diffusion models to compute MI as half the integrated MMSE gap between conditional and unconditional diffusion across all SNRs, with adaptive importance sampling for scalability.

Result: The method passes self-consistency tests, outperforms traditional and score-based diffusion MI estimators, and works well even for high MI values.

Conclusion: Diffusion models provide an effective and scalable approach to mutual information estimation, leveraging their information-theoretic formulation.

Abstract: Mutual information (MI) is one of the most general ways to measure relationships between random variables, but estimating this quantity for complex systems is challenging. Denoising diffusion models have recently set a new bar for density estimation, so it is natural to consider whether these methods could also be used to improve MI estimation. Using the recently introduced information-theoretic formulation of denoising diffusion models, we show the diffusion models can be used in a straightforward way to estimate MI. In particular, the MI corresponds to half the gap in the Minimum Mean Square Error (MMSE) between conditional and unconditional diffusion, integrated over all Signal-to-Noise-Ratios (SNRs) in the noising process. Our approach not only passes self-consistency tests but also outperforms traditional and score-based diffusion MI estimators. Furthermore, our method leverages adaptive importance sampling to achieve scalable MI estimation, while maintaining strong performance even when the MI is high.

[391] Think Smart, Not Hard: Difficulty Adaptive Reasoning for Large Audio Language Models

Zhichao Sheng, Shilin Zhou, Chen Gong, Zhenghua Li

Main category: cs.LG

TL;DR: Proposes a difficulty-adaptive reasoning method for Large Audio Language Models that dynamically adjusts reasoning depth based on problem complexity, improving both performance and efficiency.

DetailsMotivation: Current LALMs use a "one-size-fits-all" reasoning depth, causing overthinking on simple problems and insufficient reasoning on complex ones. The paper aims to enable smart reasoning by adapting depth to problem difficulty.

Method: Develops a reward function that dynamically links reasoning length to perceived problem difficulty, encouraging concise reasoning for easy tasks and elaborate reasoning for complex ones.

Result: Extensive experiments show the method is effective and efficient, improving task performance while significantly reducing average reasoning length.

Conclusion: The proposed difficulty-adaptive reasoning method successfully enables LALMs to reason smartly by adapting to problem complexity, with analysis providing insights for future work.

Abstract: Large Audio Language Models (LALMs), powered by the chain-of-thought (CoT) paradigm, have shown remarkable reasoning capabilities. Intuitively, different problems often require varying depths of reasoning. While some methods can determine whether to reason for a given problem, they typically lack a fine-grained mechanism to modulate how much to reason. This often results in a ``one-size-fits-all’’ reasoning depth, which generates redundant overthinking for simple questions while failing to allocate sufficient thought to complex ones. In this paper, we conduct an in-depth analysis of LALMs and find that an effective and efficient LALM should reason smartly by adapting its reasoning depth to the problem’s complexity. To achieve this, we propose a difficulty-adaptive reasoning method for LALMs. Specifically, we propose a reward function that dynamically links reasoning length to the model’s perceived problem difficulty. This reward encourages shorter, concise reasoning for easy tasks and more elaborate, in-depth reasoning for complex ones. Extensive experiments demonstrate that our method is both effective and efficient, simultaneously improving task performance and significantly reducing the average reasoning length. Further analysis on reasoning structure paradigm offers valuable insights for future work.

[392] Observation-Free Attacks on Online Learning to Rank

Sameep Chattopadhyay, Nikhil Karamchandani, Sharayu Moharir

Main category: cs.LG

TL;DR: The paper presents novel attack strategies against online learning to rank algorithms that can promote target items to top-K recommendations while causing linear regret to the learning algorithm, requiring only O(log T) manipulations.

DetailsMotivation: Online learning to rank algorithms are widely used in search engines and content recommenders, but their vulnerability to coordinated adversarial attacks is poorly understood, motivating the need to study and demonstrate effective attack strategies.

Method: Proposed two novel attack strategies: CascadeOFA for CascadeUCB1 and PBMOFA for PBM-UCB, designed to manipulate rankings while minimizing the number of required manipulations.

Result: Both attack strategies theoretically require only O(log T) manipulations to succeed in promoting target items to top-K recommendations for T - o(T) rounds while inducing linear regret in the learning algorithm.

Conclusion: The study reveals significant vulnerabilities in widely used OLTR algorithms to coordinated adversarial attacks, with practical implications for the security of search engines and recommendation systems.

Abstract: Online learning to rank (OLTR) plays a critical role in information retrieval and machine learning systems, with a wide range of applications in search engines and content recommenders. However, despite their extensive adoption, the susceptibility of OLTR algorithms to coordinated adversarial attacks remains poorly understood. In this work, we present a novel framework for attacking some of the widely used OLTR algorithms. Our framework is designed to promote a set of target items so that they appear in the list of top-K recommendations for T - o(T) rounds, while simultaneously inducing linear regret in the learning algorithm. We propose two novel attack strategies: CascadeOFA for CascadeUCB1 and PBMOFA for PBM-UCB . We provide theoretical guarantees showing that both strategies require only O(log T) manipulations to succeed. Additionally, we supplement our theoretical analysis with empirical results on real-world data.

[393] SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, Joseph E. Gonzalez, Jun Zhu, Jianfei Chen

Main category: cs.LG

TL;DR: SLA (Sparse-Linear Attention) accelerates diffusion transformers by fusing sparse and linear attention, reducing attention computation by 95% and achieving 2.2x end-to-end speedup in video generation without quality loss.

DetailsMotivation: Attention latency is a major bottleneck in Diffusion Transformer (DiT) models for video generation due to long sequence lengths and quadratic complexity of standard attention mechanisms.

Method: SLA classifies attention weights into critical (O(N²) attention), marginal (O(N) linear attention), and negligible (skipped) categories, then fuses these computations into a single GPU kernel for both forward and backward passes.

Result: SLA achieves 20x reduction in attention computation, 95% computation reduction without quality degradation, 13.7x speedup in attention computation, and 2.2x end-to-end speedup in video generation on Wan2.1-1.3B model.

Conclusion: SLA provides an effective trainable attention method that significantly accelerates DiT models while maintaining generation quality, outperforming baseline methods through its sparse-linear fusion approach.

Abstract: In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N^2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B. The code is available at https://github.com/thu-ml/SLA.

[394] DeepEN: A Deep Reinforcement Learning Framework for Personalized Enteral Nutrition in Critical Care

Daniel Jason Tan, Jiayang Chen, Dilruk Perera, Kay Choong See, Mengling Feng

Main category: cs.LG

TL;DR: DeepEN is a reinforcement learning framework that personalizes enteral nutrition dosing for ICU patients, achieving 3.7% absolute mortality reduction compared to clinician policies.

DetailsMotivation: Current ICU enteral feeding is suboptimal due to limited personalization and uncertainty about appropriate nutrition targets under rapidly changing metabolic demands and heterogeneous patient responses.

Method: Uses reinforcement learning trained on 11,000+ ICU patients from MIMIC-IV database, with dueling double deep Q-network and Conservative Q-Learning regularization for safe policy learning from retrospective data.

Result: Achieved 3.7% absolute mortality reduction (18.8% vs 22.5%) and higher expected returns (11.89 vs 8.11) compared to clinician policy, with improvements in nutritional biomarkers.

Conclusion: Conservative offline RL is feasible for individualized EN therapy and data-driven personalization may improve outcomes beyond guideline-based approaches.

Abstract: ICU enteral feeding remains sub-optimal due to limited personalization and uncertainty about appropriate calorie, protein, and fluid targets, particularly under rapidly changing metabolic demands and heterogeneous patient responses. This study introduces DeepEN, a reinforcement learning (RL)-based framework that personalizes enteral nutrition (EN) dosing for critically ill patients using electronic health record data. DeepEN was trained on over 11,000 ICU patients from the MIMIC-IV database to generate 4-hourly, patient-specific targets for caloric, protein, and fluid intake. The model’s state space integrates demographics, comorbidities, vital signs, laboratory results, and prior interventions relevant to nutritional management, while its reward function balances short-term physiological and nutrition-related goals with long-term survival. A dueling double deep Q-network with Conservative Q-Learning regularization is used to ensure safe and reliable policy learning from retrospective data. DeepEN achieved a 3.7 $\pm$ 0.17 percentage-point absolute reduction in estimated mortality compared with the clinician policy (18.8% vs 22.5%) and higher expected returns compared with guideline-based dosing (11.89 vs 8.11), with improvements in key nutritional biomarkers. U-shaped associations between deviations from clinician dosing and mortality suggest that the learned policy aligns with high-value clinician actions while diverging from suboptimal ones. These findings demonstrate the feasibility of conservative offline RL for individualized EN therapy and suggest that data-driven personalization may improve outcomes beyond guideline- or heuristic-based approaches.

[395] Models Got Talent: Identifying High Performing Wearable Human Activity Recognition Models Without Training

Richard Goldman, Varun Komperla, Thomas Ploetz, Harish Haresamudram

Main category: cs.LG

TL;DR: Zero Cost Proxies (ZCPs) enable efficient neural architecture search for Human Activity Recognition (HAR) by achieving within 5% performance of full training with minimal computation.

DetailsMotivation: To address the computational expense of Neural Architecture Search (NAS) by developing efficient alternatives that can discover high-performing architectures with minimal training.

Method: Investigated Zero Cost Proxies (ZCPs) for HAR on six benchmark datasets, using single forward/backward passes on randomly sampled data batches to evaluate architecture performance.

Result: ZCPs discovered network architectures that achieved within 5% performance of full-scale training involving 1500 randomly sampled architectures, with substantial computational savings.

Conclusion: ZCPs are effective for sensor-based HAR, robust to data noise, and suitable for practical scenarios due to their computational efficiency and performance.

Abstract: A promising alternative to the computationally expensive Neural Architecture Search (NAS) involves the development of Zero Cost Proxies (ZCPs), which correlate well with trained performance, but can be computed through a single forward/backward pass on a randomly sampled batch of data. In this paper, we investigate the effectiveness of ZCPs for HAR on six benchmark datasets, and demonstrate that they discover network architectures that obtain within 5% of performance attained by full-scale training involving 1500 randomly sampled architectures. This results in substantial computational savings as high-performing architectures can be discovered with minimal training. Our experiments not only introduce ZCPs to sensor-based HAR, but also demonstrate that they are robust to data noise, further showcasing their suitability for practical scenarios.

[396] Planning in Branch-and-Bound: Model-Based Reinforcement Learning for Exact Combinatorial Optimization

Paul Strang, Zacharie Alès, Côme Bissuel, Safia Kedad-Sidhoum, Emmanuel Rachelson

Main category: cs.LG

TL;DR: PlanB&B is a model-based reinforcement learning agent that uses learned B&B dynamics to improve branching strategies in MILP optimization, outperforming previous RL methods.

DetailsMotivation: To move beyond static, hand-crafted heuristics for variable selection in branch-and-bound algorithms and leverage recent RL successes in combinatorial problems.

Method: Uses model-based reinforcement learning with a learned internal model of B&B dynamics, inspired by Monte Carlo Tree Search approaches from board games.

Result: Outperformed previous state-of-the-art RL methods across four standard MILP benchmarks in computational experiments.

Conclusion: Model-based RL with learned B&B dynamics can effectively discover improved branching strategies for MILP optimization problems.

Abstract: Mixed-Integer Linear Programming (MILP) lies at the core of many real-world combinatorial optimization (CO) problems, traditionally solved by branch-and-bound (B&B). A key driver influencing B&B solvers efficiency is the variable selection heuristic that guides branching decisions. Looking to move beyond static, hand-crafted heuristics, recent work has explored adapting traditional reinforcement learning (RL) algorithms to the B&B setting, aiming to learn branching strategies tailored to specific MILP distributions. In parallel, RL agents have achieved remarkable success in board games, a very specific type of combinatorial problems, by leveraging environment simulators to plan via Monte Carlo Tree Search (MCTS). Building on these developments, we introduce Plan-and-Branch-and-Bound (PlanB&B), a model-based reinforcement learning (MBRL) agent that leverages a learned internal model of the B&B dynamics to discover improved branching strategies. Computational experiments empirically validate our approach, with our MBRL branching agent outperforming previous state-of-the-art RL methods across four standard MILP benchmarks.

[397] WildfireGenome: Interpretable Machine Learning Reveals Local Drivers of Wildfire Risk and Their Cross-County Variation

Chenyue Liu, Ali Mostafavi

Main category: cs.LG

TL;DR: WildfireGenome improves wildfire risk assessment by combining federal indicators into interpretable composite risk labels, using Random Forest models with SHAP/ICE analysis to reveal local drivers like needleleaf forest cover and elevation.

DetailsMotivation: Current wildfire risk assessments use coarse maps and opaque ML models that lack interpretability at decision-making scales, limiting practical utility for vegetation management and planning.

Method: Three components: (1) fuse 7 federal wildfire indicators into PCA-based composite risk labels at high resolution; (2) Random Forest classification; (3) SHAP and ICE/PDP analyses to expose nonlinear driver relationships.

Result: Models achieved 0.755-0.878 accuracy and Quadratic Weighted Kappa up to 0.951 across 7 diverse US counties, with principal components explaining 87-94% of variance. Transfer tests worked well in similar ecological regions but failed across dissimilar contexts.

Conclusion: WildfireGenome advances wildfire risk assessment from regional prediction to interpretable, decision-scale analytics that can guide vegetation management, zoning, and infrastructure planning, with needleleaf forest cover and elevation as key drivers.

Abstract: Current wildfire risk assessments rely on coarse hazard maps and opaque machine learning models that optimize regional accuracy while sacrificing interpretability at the decision scale. WildfireGenome addresses these gaps through three components: (1) fusion of seven federal wildfire indicators into a sign-aligned, PCA-based composite risk label at H3 Level-8 resolution; (2) Random Forest classification of local wildfire risk; and (3) SHAP and ICE/PDP analyses to expose county-specific nonlinear driver relationships. Across seven ecologically diverse U.S. counties, models achieve accuracies of 0.755-0.878 and Quadratic Weighted Kappa up to 0.951, with principal components explaining 87-94% of indicator variance. Transfer tests show reliable performance between ecologically similar regions but collapse across dissimilar contexts. Explanations consistently highlight needleleaf forest cover and elevation as dominant drivers, with risk rising sharply at 30-40% needleleaf coverage. WildfireGenome advances wildfire risk assessment from regional prediction to interpretable, decision-scale analytics that guide vegetation management, zoning, and infrastructure planning.

[398] Uncertainty Makes It Stable: Curiosity-Driven Quantized Mixture-of-Experts

Sebastián Andrés Cajas Ordóñez, Luis Fernando Torres Torres, Mackenzie J. Meni, Carlos Andrés Duran Paredes, Eric Arazo, Cristian Bosch, Ricardo Simon Carbajo, Yuan Lai, Leo Anthony Celi

Main category: cs.LG

TL;DR: A curiosity-driven quantized Mixture-of-Experts framework achieves 99.9% of 16-bit accuracy with 4-bit quantization, 4x compression, 41% energy savings, and 82% reduction in latency variance for edge deployment.

DetailsMotivation: Address challenges of maintaining accuracy under aggressive quantization while ensuring predictable inference latency for deep neural networks on resource-constrained devices.

Method: Curiosity-driven quantized Mixture-of-Experts framework using Bayesian epistemic uncertainty-based routing across heterogeneous experts (BitNet ternary, 1-16 bit BitLinear, post-training quantization).

Result: 4-bit quantization maintains 99.9% of 16-bit accuracy (0.858 vs 0.859 F1) with 4x compression, 41% energy savings vs 8-bit, and 82% reduction in MoE latency variance (from 230ms to 29ms standard deviation).

Conclusion: Adaptive quantization yields accurate, energy-efficient, and predictable edge models, with simple 4-bit quantized architectures outperforming complex MoE for most deployments.

Abstract: Deploying deep neural networks on resource-constrained devices faces two critical challenges: maintaining accuracy under aggressive quantization while ensuring predictable inference latency. We present a curiosity-driven quantized Mixture-of-Experts framework that addresses both through Bayesian epistemic uncertainty-based routing across heterogeneous experts (BitNet ternary, 1-16 bit BitLinear, post-training quantization). Evaluated on audio classification benchmarks (ESC-50, Quinn, UrbanSound8K), our 4-bit quantization maintains 99.9 percent of 16-bit accuracy (0.858 vs 0.859 F1) with 4x compression and 41 percent energy savings versus 8-bit. Crucially, curiosity-driven routing reduces MoE latency variance by 82 percent (p = 0.008, Levene’s test) from 230 ms to 29 ms standard deviation, enabling stable inference for battery-constrained devices. Statistical analysis confirms 4-bit/8-bit achieve practical equivalence with full precision (p > 0.05), while MoE architectures introduce 11 percent latency overhead (p < 0.001) without accuracy gains. At scale, deployment emissions dominate training by 10000x for models serving more than 1,000 inferences, making inference efficiency critical. Our information-theoretic routing demonstrates that adaptive quantization yields accurate (0.858 F1, 1.2M params), energy-efficient (3.87 F1/mJ), and predictable edge models, with simple 4-bit quantized architectures outperforming complex MoE for most deployments.

[399] CATCHFed: Efficient Unlabeled Data Utilization for Semi-Supervised Federated Learning in Limited Labels Environments

Byoungjun Park, Pedro Porto Buarque de Gusmão, Dongjin Ji, Minhoe Kim

Main category: cs.LG

TL;DR: CATCHFed is a semi-supervised federated learning method that addresses performance degradation in limited-label scenarios using adaptive thresholds and hybrid approaches to improve pseudo-label quality and leverage unlabeled data.

DetailsMotivation: Real-world federated learning often lacks client-side labeled data, and existing semi-supervised FL methods suffer significant performance degradation when labeled data is scarce.

Method: Proposes client-aware adaptive thresholds considering class difficulty, hybrid thresholds for better pseudo-label quality, and uses unpseudo-labeled data for consistency regularization.

Result: Extensive experiments show CATCHFed effectively leverages unlabeled client data and achieves superior performance even in extremely limited-label settings across various datasets.

Conclusion: CATCHFed successfully addresses the challenge of limited labeled data in federated learning through adaptive thresholding and hybrid approaches, demonstrating robust performance in scarce-label scenarios.

Abstract: Federated learning is a promising paradigm that utilizes distributed client resources while preserving data privacy. Most existing FL approaches assume clients possess labeled data, however, in real-world scenarios, client-side labels are often unavailable. Semi-supervised Federated learning, where only the server holds labeled data, addresses this issue. However, it experiences significant performance degradation as the number of labeled data decreases. To tackle this problem, we propose \textit{CATCHFed}, which introduces client-aware adaptive thresholds considering class difficulty, hybrid thresholds to enhance pseudo-label quality, and utilizes unpseudo-labeled data for consistency regularization. Extensive experiments across various datasets and configurations demonstrate that CATCHFed effectively leverages unlabeled client data, achieving superior performance even in extremely limited-label settings.

[400] AdamX: An Adam improvement algorithm based on a novel exponential decay mechanism for the second-order moment estimate

Meng Zhu, Quan Xiao, Weidong Min

Main category: cs.LG

TL;DR: AdamX is a new optimization algorithm that improves upon Adam by introducing a novel second-order moment estimation exponential decay rate that gradually reduces learning step correction strength, eventually degrading to SGD for better training stability and generalization.

DetailsMotivation: Adam tends to converge to non-flat minima compared to SGD-based algorithms, which can negatively affect generalization performance in large language model training.

Method: Proposes AdamX with a novel second-order moment estimation exponential decay rate that weakens learning step correction strength over time and degrades to SGD during stable training periods.

Result: Experimental results show AdamX’s second-order moment estimation exponential decay rate outperforms current methods, and AdamX consistently outperforms Adam and its variants in performance.

Conclusion: AdamX provides improved training stability and potentially better generalization by addressing Adam’s tendency to converge to non-flat minima through adaptive decay rate scheduling.

Abstract: Since the 21st century, artificial intelligence has been leading a new round of industrial revolution. Under the training framework, the optimization algorithm aims to stably converge high-dimensional optimization to local and even global minima. Entering the era of large language models, although the scale of model parameters and data has increased, Adam remains the mainstream optimization algorithm. However, compared with stochastic gradient descent (SGD) based optimization algorithms, Adam is more likely to converge to non-flat minima. To address this issue, the AdamX algorithm is proposed. Its core innovation lies in the proposition of a novel type of second-order moment estimation exponential decay rate, which gradually weakens the learning step correction strength as training progresses, and degrades to SGD in the stable training period, thereby improving the stability of training in the stable period and possibly enhancing generalization ability. Experimental results show that our second-order moment estimation exponential decay rate is better than the current second-order moment estimation exponential decay rate, and AdamX can stably outperform Adam and its variants in terms of performance. Our code is open-sourced at https://github.com/mengzhu0308/AdamX.

[401] Weather Maps as Tokens: Transformers for Renewable Energy Forecasting

Federico Battini

Main category: cs.LG

TL;DR: A transformer-based approach that treats weather maps as tokens to predict renewable energy, achieving significant improvements over existing forecasts.

DetailsMotivation: Current approaches fail to effectively integrate spatial weather patterns with temporal evolution for accurate renewable energy forecasting, which is essential for grid decarbonization.

Method: Hourly weather maps are encoded as spatial tokens using a lightweight CNN, then processed by a transformer to capture temporal dynamics across a 45-hour forecast horizon.

Result: Evaluation against ENTSO-E operational forecasts shows 60% RMSE reduction for wind and 20% for solar, despite disadvantages in input initialization.

Conclusion: The approach successfully integrates spatial and temporal weather information for improved renewable energy forecasting, with a live dashboard available for daily forecasts.

Abstract: Accurate renewable energy forecasting is essential to reduce dependence on fossil fuels and enabling grid decarbonization. However, current approaches fail to effectively integrate the rich spatial context of weather patterns with their temporal evolution. This work introduces a novel approach that treats weather maps as tokens in transformer sequences to predict renewable energy. Hourly weather maps are encoded as spatial tokens using a lightweight convolutional neural network, and then processed by a transformer to capture temporal dynamics across a 45-hour forecast horizon. Despite disadvantages in input initialization, evaluation against ENTSO-E operational forecasts shows a reduction in RMSE of about 60% and 20% for wind and solar respectively. A live dashboard showing daily forecasts is available at: https://www.sardiniaforecast.ifabfoundation.it.

[402] Full-Atom Peptide Design via Riemannian-Euclidean Bayesian Flow Networks

Hao Qian, Shikui Tu, Lei Xu

Main category: cs.LG

TL;DR: PepBFN is a Bayesian flow network for full atom peptide design that addresses limitations of diffusion models by modeling discrete residue types in continuous space and capturing multimodal side chain distributions.

DetailsMotivation: Current diffusion and flow matching models face challenges: categorical sampling disrupts continuous parameter dynamics, and unimodal assumptions conflict with multimodal side chain rotameric states, limiting performance.

Method: PepBFN models discrete residue types as continuous parameter distributions, uses Gaussian mixture Bayesian flow for multimodal side chains, and Matrix Fisher Riemannian flow for residue orientations on SO(3) manifold.

Result: Experiments on side chain packing, reverse folding, and binder design tasks demonstrate PepBFN’s strong potential in computational peptide design.

Conclusion: PepBFN enables smooth and coherent peptide generation through progressive Bayesian updates of parameter distributions, overcoming key limitations of existing approaches.

Abstract: Diffusion and flow matching models have recently emerged as promising approaches for peptide binder design. Despite their progress, these models still face two major challenges. First, categorical sampling of discrete residue types collapses their continuous parameters into onehot assignments, while continuous variables (e.g., atom positions) evolve smoothly throughout the generation process. This mismatch disrupts the update dynamics and results in suboptimal performance. Second, current models assume unimodal distributions for side-chain torsion angles, which conflicts with the inherently multimodal nature of side chain rotameric states and limits prediction accuracy. To address these limitations, we introduce PepBFN, the first Bayesian flow network for full atom peptide design that directly models parameter distributions in fully continuous space. Specifically, PepBFN models discrete residue types by learning their continuous parameter distributions, enabling joint and smooth Bayesian updates with other continuous structural parameters. It further employs a novel Gaussian mixture based Bayesian flow to capture the multimodal side chain rotameric states and a Matrix Fisher based Riemannian flow to directly model residue orientations on the $\mathrm{SO}(3)$ manifold. Together, these parameter distributions are progressively refined via Bayesian updates, yielding smooth and coherent peptide generation. Experiments on side chain packing, reverse folding, and binder design tasks demonstrate the strong potential of PepBFN in computational peptide design.

[403] $π^{*}_{0.6}$: a VLA That Learns From Experience

Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine Glossop, Thomas Godden, Ivan Goryachev, Lachy Groom, Hunter Hancock, Karol Hausman, Gashon Hussein, Brian Ichter, Szymon Jakubczak, Rowan Jen, Tim Jones, Ben Katz, Liyiming Ke, Chandra Kuchi, Marinda Lamb, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Yao Lu, Vishnu Mano, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, Charvi Sharma, Lucy Xiaoyang Shi, Laura Smith, Jost Tobias Springenberg, Kyle Stachowicz, Will Stoeckle, Alex Swerdlow, James Tanner, Marcel Torne, Quan Vuong, Anna Walling, Haohuan Wang, Blake Williams, Sukwon Yoo, Lili Yu, Ury Zhilinsky, Zhiyuan Zhou

Main category: cs.LG

TL;DR: RECAP method enables vision-language-action models to self-improve through real-world RL training using heterogeneous data including demonstrations, on-policy collection, and expert interventions.

DetailsMotivation: To improve vision-language-action models through real-world deployments using reinforcement learning, incorporating various data sources for more effective self-improvement.

Method: RECAP uses advantage-conditioned policies to incorporate heterogeneous data (demonstrations, on-policy data, expert interventions) into RL training, starting with offline RL pre-training then specializing through on-robot data collection.

Result: The method achieves practical performance on real-world tasks: folding laundry in homes, assembling boxes, and making espresso drinks, with more than doubled task throughput and halved failure rates on hardest tasks.

Conclusion: RECAP provides an effective framework for VLA model self-improvement through real-world RL training, demonstrating significant performance gains on complex manipulation tasks.

Abstract: We study how vision-language-action (VLA) models can improve through real-world deployments via reinforcement learning (RL). We present a general-purpose method, RL with Experience and Corrections via Advantage-conditioned Policies (RECAP), that provides for RL training of VLAs via advantage conditioning. Our method incorporates heterogeneous data into the self-improvement process, including demonstrations, data from on-policy collection, and expert teleoperated interventions provided during autonomous execution. RECAP starts by pre-training a generalist VLA with offline RL, which we call $π^{}_{0.6}$, that can then be specialized to attain high performance on downstream tasks through on-robot data collection. We show that the $π^{}_{0.6}$ model trained with the full RECAP method can fold laundry in real homes, reliably assemble boxes, and make espresso drinks using a professional espresso machine. On some of the hardest tasks, RECAP more than doubles task throughput and roughly halves the task failure rate.

cs.MA

[404] Area-Optimal Control Strategies for Heterogeneous Multi-Agent Pursuit

Kamal Mammadov, Damith C. Ranasinghe

Main category: cs.MA

TL;DR: A gradient-based control strategy for multi-agent pursuit-evasion games where faster pursuers cooperatively minimize the evader’s safe-reachable set area to guarantee capture.

DetailsMotivation: To develop computationally efficient, real-time control strategies for multi-agent pursuit-evasion games with heterogeneous agent speeds, where pursuers need to cooperatively capture a slower evader.

Method: Define the evader’s safe-reachable set as the intersection of Apollonius circles from each pursuer-evader pair. Formulate capture as a zero-sum game where pursuers minimize this set’s area while the evader maximizes it. Derive analytical gradients of the area with respect to positions to obtain closed-form optimal control laws for agent headings.

Result: The gradient-based controls effectively steer pursuers to systematically shrink the evader’s safe region, leading to guaranteed capture. The strategies are computationally efficient and suitable for real-time implementation.

Conclusion: The area-minimization approach provides a clear geometric objective for cooperative capture in multi-agent pursuit-evasion games with heterogeneous speeds, offering computationally efficient and effective control strategies.

Abstract: This paper presents a novel strategy for a multi-agent pursuit-evasion game involving multiple faster pursuers with heterogenous speeds and a single slower evader. We define a geometric region, the evader’s safe-reachable set, as the intersection of Apollonius circles derived from each pursuer-evader pair. The capture strategy is formulated as a zero-sum game where the pursuers cooperatively minimize the area of this set, while the evader seeks to maximize it, effectively playing a game of spatial containment. By deriving the analytical gradients of the safe-reachable set’s area with respect to agent positions, we obtain closed-form, instantaneous optimal control laws for the heading of each agent. These strategies are computationally efficient, allowing for real-time implementation. Simulations demonstrate that the gradient-based controls effectively steer the pursuers to systematically shrink the evader’s safe region, leading to guaranteed capture. This area-minimization approach provides a clear geometric objective for cooperative capture.

[405] Distributed primal-dual algorithm for constrained multi-agent reinforcement learning under coupled policies

Pengcheng Dai, He Wang, Dongming Wang, Wenwu Yu

Main category: cs.MA

TL;DR: A distributed primal-dual algorithm for constrained multi-agent reinforcement learning where agents maximize local objectives while satisfying safety constraints using limited neighborhood information exchange.

DetailsMotivation: To address constrained multi-agent reinforcement learning problems where agents need to collaborate while satisfying individual safety constraints, with limited communication and enhanced security.

Method: Proposed a framework with coupled policies based on local states and neighbor parameters, using distributed primal-dual algorithm with local estimates and time-varying networks for secure information exchange.

Result: The algorithm achieves ε-first-order stationary convergence with approximation error O(γ^{(κ+1)/κ_p}) for discount factor γ∈(0,1), validated through GridWorld simulations.

Conclusion: The proposed distributed approach effectively solves constrained multi-agent reinforcement learning problems with limited communication and enhanced security while maintaining convergence guarantees.

Abstract: In this work, we investigate constrained multi-agent reinforcement learning (CMARL), where agents collaboratively maximize the sum of their local objectives while satisfying individual safety constraints. We propose a framework where agents adopt coupled policies that depend on both local states and parameters, as well as those of their $κ_p$-hop neighbors, with $κ_p>0$ denoting the coupling distance. A distributed primal-dual algorithm is further developed under this framework, wherein each agent has access only to state-action pairs within its $2κ_p$-hop neighborhood and to reward information within its $κ+ 2κ_p$-hop neighborhood, with $κ> 0$ representing the truncation distance. Moreover, agents are not permitted to directly share their true policy parameters or Lagrange multipliers. Instead, each agent constructs and maintains local estimates of these variables for other agents and employs such estimates to execute its policy. Additionally, these estimates are further updated and exchanged exclusively through an independent, time-varying networks, which enhances the overall system security. We establish that, with high probability, our algorithm can achieve an $ε$-first-order stationary convergence with an approximation error of $\mathcal{O}(γ^{\frac{κ+1}{κ_{p}}})$ for discount factor $γ\in(0,1)$. Finally, simulations in GridWorld environment are conducted to demonstrate the effectiveness of the proposed algorithm.

[406] Adversarial Attack on Black-Box Multi-Agent by Adaptive Perturbation

Jianming Chen, Yawen Wang, Junjie Wang, Xiaofei Xie, Yuanzhe Hu, Qing Wang, Fanjiang Xu

Main category: cs.MA

TL;DR: AdapAM is a novel adversarial attack framework for black-box multi-agent systems that balances effectiveness and stealthiness through adaptive victim selection and proxy-based perturbation.

DetailsMotivation: Existing adversarial attack frameworks for multi-agent systems have limitations including impractical white-box requirements, lack of stealthiness, and ineffective targeting strategies.

Method: AdapAM uses adaptive selection policy to choose victims and determine malicious actions, and proxy-based perturbation with generative adversarial imitation learning to approximate target systems and generate perturbations.

Result: AdapAM achieves best attack performance across eight multi-agent environments with different perturbation rates, and generates least noisy, hardest-to-detect perturbations.

Conclusion: AdapAM effectively addresses limitations of existing frameworks by providing a practical, stealthy, and effective adversarial attack method for black-box multi-agent systems.

Abstract: Evaluating security and reliability for multi-agent systems (MAS) is urgent as they become increasingly prevalent in various applications. As an evaluation technique, existing adversarial attack frameworks face certain limitations, e.g., impracticality due to the requirement of white-box information or high control authority, and a lack of stealthiness or effectiveness as they often target all agents or specific fixed agents. To address these issues, we propose AdapAM, a novel framework for adversarial attacks on black-box MAS. AdapAM incorporates two key components: (1) Adaptive Selection Policy simultaneously selects the victim and determines the anticipated malicious action (the action would lead to the worst impact on MAS), balancing effectiveness and stealthiness. (2) Proxy-based Perturbation to Induce Malicious Action utilizes generative adversarial imitation learning to approximate the target MAS, allowing AdapAM to generate perturbed observations using white-box information and thus induce victims to execute malicious action in black-box settings. We evaluate AdapAM across eight multi-agent environments and compare it with four state-of-the-art and commonly-used baselines. Results demonstrate that AdapAM achieves the best attack performance in different perturbation rates. Besides, AdapAM-generated perturbations are the least noisy and hardest to detect, emphasizing the stealthiness.

[407] Assemble Your Crew: Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation

Shiyuan Li, Yixin Liu, Qingsong Wen, Chengqi Zhang, Shirui Pan

Main category: cs.MA

TL;DR: ARG-Designer reframes multi-agent system design as conditional autoregressive graph generation, creating customized collaboration topologies from scratch based on task requirements.

DetailsMotivation: Existing MAS design approaches are limited by template-based graph modification with predefined agents and hard-coded interactions, restricting adaptability to task-specific needs.

Method: Proposes ARG-Designer, an autoregressive model that sequentially determines agent count, selects roles from an extensible pool, and establishes optimal communication links conditioned on natural language task queries.

Result: Achieves state-of-the-art performance across six benchmarks with significantly greater token efficiency and enhanced extensibility compared to existing methods.

Conclusion: The generative approach enables flexible and extensible MAS design, creating customized topologies precisely tailored to different task requirements.

Abstract: Multi-agent systems (MAS) based on large language models (LLMs) have emerged as a powerful solution for dealing with complex problems across diverse domains. The effectiveness of MAS is critically dependent on its collaboration topology, which has become a focal point for automated design research. However, existing approaches are fundamentally constrained by their reliance on a template graph modification paradigm with a predefined set of agents and hard-coded interaction structures, significantly limiting their adaptability to task-specific requirements. To address these limitations, we reframe MAS design as a conditional autoregressive graph generation task, where both the system composition and structure are designed jointly. We propose ARG-Designer, a novel autoregressive model that operationalizes this paradigm by constructing the collaboration graph from scratch. Conditioned on a natural language task query, ARG-Designer sequentially and dynamically determines the required number of agents, selects their appropriate roles from an extensible pool, and establishes the optimal communication links between them. This generative approach creates a customized topology in a flexible and extensible manner, precisely tailored to the unique demands of different tasks. Extensive experiments across six diverse benchmarks demonstrate that ARG-Designer not only achieves state-of-the-art performance but also enjoys significantly greater token efficiency and enhanced extensibility. The source code of ARG-Designer is available at https://github.com/Shiy-Li/ARG-Designer.

[408] S-DAG: A Subject-Based Directed Acyclic Graph for Multi-Agent Heterogeneous Reasoning

Jiangwen Dong, Zehui Lin, Wanyu Lin, Mingjin Zhang

Main category: cs.MA

TL;DR: Proposes a subject-level multi-agent framework using Graph Neural Networks to identify relevant subjects and their dependencies, then matches specialized LLMs to each subject for structured collaboration, outperforming task-level approaches on multi-subject reasoning tasks.

DetailsMotivation: Existing mixture-of-experts approaches operate at task level, which is too coarse for heterogeneous problems involving multiple subjects. Need fine-grained analysis at subject level to effectively solve complex multi-subject reasoning tasks.

Method: 1) Use GNN to identify relevant subjects and generate Subject-based DAG (S-DAG) with subject nodes and information flow edges. 2) Profile LLMs with subject-specific expertise scores. 3) Match top-performing models to S-DAG subjects. 4) Enable graph-structured multi-agent collaboration with information flowing through S-DAG.

Result: Significantly outperforms existing task-level model selection and multi-agent collaboration baselines in accuracy and efficiency on curated multi-subject subsets of MMLU-Pro, GPQA, and MedMCQA benchmarks.

Conclusion: Subject-aware reasoning with structured collaboration is highly effective for addressing complex multi-subject problems, demonstrating the value of fine-grained subject-level analysis over coarse task-level approaches.

Abstract: Large Language Models (LLMs) have achieved impressive performance in complex reasoning problems. Their effectiveness highly depends on the specific nature of the task, especially the required domain knowledge. Existing approaches, such as mixture-of-experts, typically operate at the task level; they are too coarse to effectively solve the heterogeneous problems involving multiple subjects. This work proposes a novel framework that performs fine-grained analysis at subject level equipped with a designated multi-agent collaboration strategy for addressing heterogeneous problem reasoning. Specifically, given an input query, we first employ a Graph Neural Network to identify the relevant subjects and infer their interdependencies to generate an \textit{Subject-based Directed Acyclic Graph} (S-DAG), where nodes represent subjects and edges encode information flow. Then we profile the LLM models by assigning each model a subject-specific expertise score, and select the top-performing one for matching corresponding subject of the S-DAG. Such subject-model matching enables graph-structured multi-agent collaboration where information flows from the starting model to the ending model over S-DAG. We curate and release multi-subject subsets of standard benchmarks (MMLU-Pro, GPQA, MedMCQA) to better reflect complex, real-world reasoning tasks. Extensive experiments show that our approach significantly outperforms existing task-level model selection and multi-agent collaboration baselines in accuracy and efficiency. These results highlight the effectiveness of subject-aware reasoning and structured collaboration in addressing complex and multi-subject problems.

cs.MM

[409] ChartEditor: A Reinforcement Learning Framework for Robust Chart Editing

Liangyu Chen, Yichen Xu, Jianzhe Ma, Yuqi Liu, Donglu Yang, Liang Zhang, Wenxuan Wang, Qin Jin

Main category: cs.MM

TL;DR: ChartEditVista is a comprehensive benchmark with 7,964 chart editing samples across 31 categories, using only images and natural language instructions without original code. ChartEditor model trained with reinforcement learning and rendering reward outperforms other models on chart editing tasks.

DetailsMotivation: Existing chart editing benchmarks are limited in data diversity and assume access to complete chart code, which is unrealistic in real-world scenarios where only chart images and natural language instructions are available.

Method: Created ChartEditVista through automated pipeline for chart generation, editing, and verification. Developed ChartEditor model using reinforcement learning with rendering reward to ensure code executability and visual fidelity. Introduced layout and text metrics for fine-grained evaluation.

Result: ChartEditVista provides robust evaluation with 7,964 diverse samples. ChartEditor consistently outperforms similar-scale and larger-scale models on chart editing tasks, as demonstrated through extensive experiments and human evaluations.

Conclusion: ChartEditVista addresses the gap in realistic chart editing benchmarks, while ChartEditor with its novel reinforcement learning approach achieves superior performance in chart editing tasks without requiring original chart code.

Abstract: Chart editing reduces manual effort in visualization design. Typical benchmarks limited in data diversity and assume access to complete chart code, which is seldom in real-world scenarios. To address this gap, we present ChartEditVista, a comprehensive benchmark consisting of 7,964 samples spanning 31 chart categories. It encompasses diverse editing instructions and covers nearly all editable chart elements. The inputs in ChartEditVista include only the original chart image and natural language editing instructions, without the original chart codes. ChartEditVista is generated through a fully automated pipeline that produces, edits, and verifies charts, ensuring high-quality chart editing data. Besides, we introduce two novel fine-grained, rule-based evaluation metrics: the layout metric, which evaluates the position, size and color of graphical components; and the text metric, which jointly assesses textual content and font styling. Building on top of ChartEditVista, we present ChartEditor, a model trained using a reinforcement learning framework that incorporates a novel rendering reward to simultaneously enforce code executability and visual fidelity. Through extensive experiments and human evaluations, we demonstrate that ChartEditVista provides a robust evaluation, while ChartEditor consistently outperforms models with similar-scale and larger-scale on chart editing tasks.

eess.AS

[410] Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion

Zanxu Wang, Homayoon Beigi

Main category: eess.AS

TL;DR: Systematic quality control and multi-stage transfer learning improve multimodal emotion recognition by leveraging identity-based embeddings from speaker and face recognition combined with emotion-tuned text representations.

DetailsMotivation: Address data quality issues in multimodal emotion recognition in conversation (MERC) by implementing systematic quality control and leveraging transfer learning from identity recognition tasks.

Method: Quality control pipeline for datasets (MELD, IEMOCAP) validating speaker identity, audio-text alignment, and face detection. Transfer learning from speaker and face recognition using RecoMadeEasy engines, fine-tuning MPNet-v2 for text, and MAMBA-based trimodal fusion.

Result: Achieved 64.8% accuracy on MELD and 74.3% on IEMOCAP, showing consistent competitive performance for multimodal emotion recognition.

Conclusion: Combining identity-based audio/visual embeddings with emotion-tuned text representations on quality-controlled data yields strong performance and provides basis for improving recognition of challenging, low-frequency emotion classes.

Abstract: This paper addresses data quality issues in multimodal emotion recognition in conversation (MERC) through systematic quality control and multi-stage transfer learning. We implement a quality control pipeline for MELD and IEMOCAP datasets that validates speaker identity, audio-text alignment, and face detection. We leverage transfer learning from speaker and face recognition, assuming that identity-discriminative embeddings capture not only stable acoustic and Facial traits but also person-specific patterns of emotional expression. We employ RecoMadeEasy(R) engines for extracting 512-dimensional speaker and face embeddings, fine-tune MPNet-v2 for emotion-aware text representations, and adapt these features through emotion-specific MLPs trained on unimodal datasets. MAMBA-based trimodal fusion achieves 64.8% accuracy on MELD and 74.3% on IEMOCAP. These results show that combining identity-based audio and visual embeddings with emotion-tuned text representations on a quality-controlled subset of data yields consistent competitive performance for multimodal emotion recognition in conversation and provides a basis for further improvement on challenging, low-frequency emotion classes.

[411] CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries

Hokuto Munakata, Takehiro Imamura, Taichi Nishimura, Tatsuya Komatsu

Main category: eess.AS

TL;DR: CASTELLA is a large-scale human-annotated audio benchmark for audio moment retrieval (AMR) that addresses limitations of previous synthetic datasets and small evaluation sets.

DetailsMotivation: Previous AMR research used synthetic datasets and small evaluation sets (under 100 samples), leading to unreliable performance metrics. There was no established benchmark with real-world data for practical applications.

Method: Created CASTELLA - a manually annotated AMR dataset with 1,009 training, 213 validation, and 640 test audio recordings (24x larger than previous datasets). Established baseline models including fine-tuning on CASTELLA after pre-training on synthetic data.

Result: Models fine-tuned on CASTELLA after pre-training on synthetic data outperformed models trained solely on synthetic data by 10.4 points in Recall1@0.7 metric.

Conclusion: CASTELLA provides a reliable benchmark for AMR with real-world data, enabling more accurate performance evaluation and model development. The dataset is publicly available.

Abstract: We introduce CASTELLA, a human-annotated audio benchmark for the task of audio moment retrieval (AMR). Although AMR has various useful potential applications, there is still no established benchmark with real-world data. The early study of AMR trained the model with solely synthetic datasets. Moreover, the evaluation is based on annotated dataset of fewer than 100 samples. This resulted in less reliable reported performance. To ensure performance for applications in real-world environments, we present CASTELLA, a large-scale manually annotated AMR dataset. CASTELLA consists of 1,009, 213, and 640 audio recordings for train, valid, and test split, respectively, which is 24 times larger than the previous dataset. We also establish a baseline model for AMR using CASTELLA. Our experiments demonstrate that a model fine-tuned on CASTELLA after pre-training on the synthetic data outperformed a model trained solely on the synthetic data by 10.4 points in Recall1@0.7. CASTELLA is publicly available in https://h-munakata.github.io/CASTELLA-demo/.

[412] Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding

Mingyue Huo, Wei-Cheng Tseng, Yiwen Shao, Hao Zhang, Dong Yu

Main category: eess.AS

TL;DR: Auden-Voice is a general-purpose voice encoder that balances identity and paralinguistic cues through multi-task training, outperforming CLAP-based approaches.

DetailsMotivation: Current audio-language models fail to adequately balance voice identity and paralinguistic information in their encoders.

Method: Multi-task training approach to create balanced voice representations, compared against contrastive language-audio pretraining (CLAP).

Result: Multi-task training yields the most balanced representations, while CLAP mainly improves retrieval without enhancing paralinguistic understanding. Auden-Voice shows strong performance when integrated with LLMs.

Conclusion: Multi-task training is superior for creating general-purpose voice encoders that capture nuanced voice cues, and Auden-Voice demonstrates effective integration with language models.

Abstract: Human voice encodes both identity and paralinguistic cues, yet encoders in large audio-language models (LALMs) rarely balance both aspects. In this work, we present a study toward building a general-purpose voice encoder that captures nuanced voice cues. Through a comprehensive evaluation, we find that multi-task training yields the most balanced representations, whereas contrastive language-audio pretraining (CLAP) primarily improves retrieval without enhancing paralinguistic understanding. Our final encoder, Auden-Voice, also demonstrates strong performance when integrated with LLMs. The code and training recipes will be released with the audio understanding toolkit Auden.

[413] Scene-wide Acoustic Parameter Estimation

Ricardo Falcon-Perez, Ruohan Gao, Gregor Mueckl, Sebastia V. Amengual Gari, Ishwarya Ananthabhotla

Main category: eess.AS

TL;DR: A method to infer spatially-distributed acoustic parameters from 2D floormaps using image-to-image translation, conditioned on a calibration RIR measurement, for AR/VR applications.

DetailsMotivation: Accurate acoustic scene characterization is critical for AR/VR immersion, but direct RIR estimation from geometry is challenging and data-expensive.

Method: Image-to-image translation approach that transforms 2D floormaps into acoustic parameter heatmaps, conditioned on a calibration RIR measurement. Also supports directionally-dependent parameter prediction.

Result: Method demonstrates improvements over strong statistical baselines and works for beamformed parameter prediction. A 1000-room complex-scene dataset was created for evaluation.

Conclusion: Proposed approach provides efficient acoustic parameter estimation from lightweight scene information available in AR/VR contexts, overcoming data-expensive RIR estimation challenges.

Abstract: For augmented (AR) and virtual reality (VR) applications, accurate estimates of the acoustic characteristics of a scene are critical for creating a sense of immersion. However, directly estimating Room-impulse Responses (RIRs) from scene geometry is often a challenging, data-expensive task. We propose a method to instead infer spatially-distributed acoustic parameters (such as C50, T60, etc) for an entire scene from lightweight information readily available in an AR/VR context. We consider an image-to-image translation task to transform a 2D floormap, conditioned on a calibration RIR measurement, into 2D heatmaps of acoustic parameters. Moreover, we show that the method also works for directionally-dependent (i.e. beamformed) parameter prediction. We introduce and release a 1000-room, complex-scene dataset to study the task, and demonstrate improvements over strong statistical baselines.

[414] Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech Recognition

Mu Yang, Szu-Jui Chen, Jiamin Xie, John Hansen

Main category: eess.AS

TL;DR: Proposes vector quantization integration with LLMs for speech recognition to bridge the gap between continuous audio and discrete LLM tokens, using soft discretization for better alignment.

DetailsMotivation: Address the discrepancy between continuous audio data and discrete token-based LLMs in speech recognition systems.

Method: Integrate vector quantization using the LLM embedding table as codebook, create soft discretization through codebook updates and weighted sum of embeddings.

Result: Significant improvement over LLM-based ASR baseline, especially in out-of-domain conditions.

Conclusion: Soft discretization shows potential as an effective modality bridge in LLM-based automatic speech recognition.

Abstract: One challenge of integrating speech input with large language models (LLMs) stems from the discrepancy between the continuous nature of audio data and the discrete token-based paradigm of LLMs. To mitigate this gap, we propose a method for integrating vector quantization (VQ) into LLM-based automatic speech recognition (ASR). Using the LLM embedding table as the VQ codebook, the VQ module aligns the continuous representations from the audio encoder with the discrete LLM inputs, enabling the LLM to operate on a discretized audio representation that better reflects the linguistic structure. We further create a soft “discretization” of the audio representation by updating the codebook and performing a weighted sum over the codebook embeddings. Empirical results demonstrate that our proposed method significantly improves upon the LLM-based ASR baseline, particularly in out-of-domain conditions. This work highlights the potential of soft discretization as a modality bridge in LLM-based ASR.

[415] Efficient and Generalizable Speaker Diarization via Structured Pruning of Self-Supervised Models

Jiangyu Han, Petr Pálka, Marc Delcroix, Federico Landini, Johan Rohdin, Jan Cernocký, Lukáš Burget

Main category: eess.AS

TL;DR: Systematic study compressing SSL-based speaker diarization models through structured pruning guided by knowledge distillation, achieving 80% size reduction and 4x faster inference without performance loss.

DetailsMotivation: High computational and memory costs of SSL models like WavLM hinder deployment in real-time and resource-constrained scenarios for speaker diarization.

Method: Structured pruning guided by knowledge distillation, investigating pruning objectives targeting both model parameters and computational complexity, with simple overall pruning approach.

Result: Achieved up to 80% model size reduction and 4x faster inference without performance degradation across eight public diarization datasets, with strong out-of-domain generalization on CHiME-6.

Conclusion: Structured pruning guided by distillation can yield efficient and generalizable diarization systems suitable for real-world applications.

Abstract: Self-supervised learning (SSL) models such as WavLM have substantially advanced speaker diarization by providing rich contextual speech representations. However, the high computational and memory costs of these models hinder deployment in real-time and resource-constrained scenarios. This work presents a systematic study on compressing SSL-based diarization models through structured pruning guided by knowledge distillation. We investigate pruning objectives that target both model parameters and computational complexity, and analyze alternative strategies, showing that a simple overall pruning approach provides the best balance between efficiency and accuracy. Our method achieves up to 80% model size reduction and 4x faster inference without performance degradation. Comprehensive experiments across eight public diarization datasets demonstrate that the pruned models consistently match or surpass the performance of their uncompressed counterparts. Furthermore, we show strong out-of-domain generalization on the CHiME-6 dataset, achieving accuracy comparable to the top systems in the CHiME-7 challenge without any domain adaptation. These results highlight that structured pruning, when guided by distillation, can yield efficient and generalizable diarization systems suitable for real-world applications.

eess.IV

[416] Application of Graph Based Vision Transformers Architectures for Accurate Temperature Prediction in Fiber Specklegram Sensors

Abhishek Sebastian

Main category: eess.IV

TL;DR: Transformer models (ViTs, Swin, LINA-ViT, MAP-ViGAT) outperform CNNs for temperature prediction from fiber specklegram data, achieving MAE of 1.15°C with enhanced interpretability via XAI techniques.

DetailsMotivation: Fiber Specklegram Sensors are effective for environmental monitoring but face challenges due to nonlinear specklegram data, requiring advanced models for accurate temperature prediction.

Method: Used transformer architectures including Vision Transformers, Swin Transformers, LINA-ViT, and MAP-ViGAT to predict temperature from specklegram data (0-120°C range), incorporating XAI techniques for interpretability.

Result: ViTs achieved best performance with MAE of 1.15°C, outperforming CNNs. GAT-ViT and MAP-ViGAT variants also showed competitive accuracy, demonstrating effectiveness of adaptive attention and graph-based structures.

Conclusion: Transformer architectures establish strong benchmarks for optical fiber temperature sensing and offer promising directions for industrial monitoring and structural health assessment applications.

Abstract: Fiber Specklegram Sensors (FSS) are highly effective for environmental monitoring, particularly for detecting temperature variations. However, the nonlinear nature of specklegram data presents significant challenges for accurate temperature prediction. This study investigates the use of transformer-based architectures, including Vision Transformers (ViTs), Swin Transformers, and emerging models such as Learnable Importance Non-Symmetric Attention Vision Transformers (LINA-ViT) and Multi-Adaptive Proximity Vision Graph Attention Transformers (MAP-ViGAT), to predict temperature from specklegram data over a range of 0 to 120 Celsius. The results show that ViTs achieved a Mean Absolute Error (MAE) of 1.15, outperforming traditional models such as CNNs. GAT-ViT and MAP-ViGAT variants also demonstrated competitive accuracy, highlighting the importance of adaptive attention mechanisms and graph-based structures in capturing complex modal interactions and phase shifts in specklegram data. Additionally, this study incorporates Explainable AI (XAI) techniques, including attention maps and saliency maps, to provide insights into the decision-making processes of the transformer models, improving interpretability and transparency. These findings establish transformer architectures as strong benchmarks for optical fiber-based temperature sensing and offer promising directions for industrial monitoring and structural health assessment applications.

[417] Fully Differentiable dMRI Streamline Propagation in PyTorch

Jongyeon Yoon, Elyssa M. McMaster, Michael E. Kim, Gaurav Rudravaram, Kurt G. Schilling, Bennett A. Landman, Daniel Moyer

Main category: eess.IV

TL;DR: Proposes a fully differentiable tractography method using PyTorch that maintains numerical fidelity with standard streamline algorithms while enabling gradient flow.

DetailsMotivation: Existing tractography methods are non-differentiable, limiting their integration into end-to-end deep learning frameworks for brain connectivity analysis.

Method: Developed a PyTorch-engineered streamline propagator with no gradient-blocking components, making the entire tractography process fully differentiable.

Result: The method matches standard propagators’ performance while remaining differentiable, enabling integration into deep learning workflows.

Conclusion: This differentiable approach enables deeper integration of tractography into deep learning, creating a new category of macrostructural reasoning that is both computationally robust and scientifically rigorous.

Abstract: Diffusion MRI (dMRI) provides a distinctive means to probe the microstructural architecture of living tissue, facilitating applications such as brain connectivity analysis, modeling across multiple conditions, and the estimation of macrostructural features. Tractography, which emerged in the final years of the 20th century and accelerated in the early 21st century, is a technique for visualizing white matter pathways in the brain using dMRI. Most diffusion tractography methods rely on procedural streamline propagators or global energy minimization methods. Although recent advancements in deep learning have enabled tasks that were previously challenging, existing tractography approaches are often non-differentiable, limiting their integration in end-to-end learning frameworks. While progress has been made in representing streamlines in differentiable frameworks, no existing method offers fully differentiable propagation. In this work, we propose a fully differentiable solution that retains numerical fidelity with a leading streamline algorithm. The key is that our PyTorch-engineered streamline propagator has no components that block gradient flow, making it fully differentiable. We show that our method matches standard propagators while remaining differentiable. By translating streamline propagation into a differentiable PyTorch framework, we enable deeper integration of tractography into deep learning workflows, laying the foundation for a new category of macrostructural reasoning that is not only computationally robust but also scientifically rigorous.

[418] Image Denoising Using Transformed L1 (TL1) Regularization via ADMM

Nabiha Choudhury, Jianqing Jia, Yifei Lou

Main category: eess.IV

TL;DR: TL1 regularization for image denoising outperforms traditional TV by reducing staircase artifacts and preserving contrast using ADMM optimization.

DetailsMotivation: Traditional TV regularization with convex l1 formulation causes staircase artifacts and contrast loss in image denoising.

Method: Transformed l1 (TL1) regularizer applied to image gradients, solved using ADMM with closed-form TL1 proximal operator and FFT-based image update under periodic boundary conditions.

Result: Superior denoising performance with effective noise suppression while preserving edges and enhancing image contrast.

Conclusion: TL1 regularization is an effective alternative to traditional TV for image denoising, addressing key limitations like staircase artifacts and contrast loss.

Abstract: Total variation (TV) regularization is a classical tool for image denoising, but its convex $\ell_1$ formulation often leads to staircase artifacts and loss of contrast. To address these issues, we introduce the Transformed $\ell_1$ (TL1) regularizer applied to image gradients. In particular, we develop a TL1-regularized denoising model and solve it using the Alternating Direction Method of Multipliers (ADMM), featuring a closed-form TL1 proximal operator and an FFT-based image update under periodic boundary conditions. Experimental results demonstrate that our approach achieves superior denoising performance, effectively suppressing noise while preserving edges and enhancing image contrast.

[419] Multimodal Optical Imaging Platform for Quantitative Burn Assessment

Nathaniel Hanson, Mateusz Wolak, Jonathan Richardson, Patrick Walker, David M. Burmeister, Chakameh Jafari

Main category: eess.IV

TL;DR: Multimodal optical imaging framework combining hyperspectral imaging and laser speckle contrast for quantitative burn severity assessment, enabling deep-tissue analysis using SWIR wavelengths for improved burn classification.

DetailsMotivation: Current lack of objective methods for detecting subsurface tissue damage in burns, especially critical in battlefield and mass-casualty settings where rapid evaluation is essential for triage and surgical decisions.

Method: Integrated broadband hyperspectral imaging (400-2100 nm) with laser speckle contrast imaging to evaluate biochemical composition and microvascular perfusion. Used SWIR wavelengths (>1000 nm) to develop deep-tissue parameters linked to water, lipid, and collagen absorption features.

Result: Developed and validated novel deep-tissue parameters that enhance burn-tissue separability and burn severity classification. Implemented unsupervised learning methods for spectral feature extraction, band down-selection, and clustering validated against histology.

Conclusion: Established foundation for a compact, low-SWaP field-deployable device for early quantitative burn evaluation in austere environments, providing objective assessment of burn severity at injury onset.

Abstract: Accurate assessment of burn severity at injury onset remains a major clinical challenge due to the lack of objective methods for detecting subsurface tissue damage. This limitation is critical in battlefield and mass-casualty settings, where rapid and reliable evaluation of burn depth is essential for triage and surgical decision-making. We present a multimodal optical imaging framework that establishes the foundation for a compact, low-size, weight, and power (low-SWaP) field-deployable device for quantitative burn assessment. The system integrates broadband hyperspectral imaging (VSWIR, 400 – 2100 nm) with laser speckle contrast imaging to jointly evaluate biochemical composition and microvascular perfusion. Using short-wave infrared (SWIR, >1000 nm) wavelengths, we developed and validated novel deep-tissue parameters linked to water, lipid, and collagen absorption features that enhance burn-tissue separability and burn severity classification. We implemented and validated unsupervised learning methods for spectral feature extraction, band down-selection, and clustering against histology, establishing a foundation for a rugged, data-driven device for early quantitative burn evaluation in austere environments.

[420] Event-based Data Format Standard (EVT+)

Jonah P. Sengupta, Mohammad Imran Vakil, Thanh M. Dang, Ian Pardee, Paul Coen, Olivia Aul

Main category: eess.IV

TL;DR: Proposes a standard data format for Event-based Sensing (EBS) hardware to address diverse output formats and enable interoperability across current and future EBS technologies.

DetailsMotivation: EBS hardware is proliferating with diverse output formats, and future sensors may introduce incompatible data schemas, creating a need for standardization to ensure interoperability.

Method: Define a sensor-agnostic standard that incorporates current sensor configurations and modalities while providing placeholders for future developments in EBS technology.

Result: A proposed standard document that identifies requirements for EBS streaming data format to accommodate both existing and emerging sensor technologies.

Conclusion: Establishing a standard now is crucial for EBS technology’s widespread adoption and future compatibility across different vendors and applications.

Abstract: Event-based Sensing (EBS) hardware is quickly proliferating while finding foothold in many commercial, industrial, and defense applications. At present, there are a handful of technologically mature systems which produce data streams with diverse output formats. In the near future it is anticipated there will be vendors who offer new sensor hardware which could also yield unique data schema that are not aligned to past efforts. Thus, due to the relative nascent nature of the technology and its potential for widespread use in a variety of applications, it is an opportune time to define a standard for this class of sensors’ output data. The intent of this document is to identify and provide a standard for the collected EBS streaming data. The main objective of the standard is to be sensor agnostic, incorporate some of the current sensor configurations and modalities, and account for the developing configurations and modalities. The intent is also to leave enough place holders and space in the standard for future variations that may develop as EBS technology matures.

[421] DeepContrast: Deep Tissue Contrast Enhancement using Synthetic Data Degradations and OOD Model Predictions

Nuno Pimpão Martins, Yannis Kalaidzidis, Marino Zerial, Florian Jug

Main category: eess.IV

TL;DR: A deep learning method that synthetically degrades microscopy images to train neural networks for improving image quality without requiring ground truth data, with iterative predictions balancing contrast enhancement and detail preservation.

DetailsMotivation: Microscopy images suffer from degradations like noise and blur, especially in deep tissue regions, making ground truth data acquisition impossible. This prevents effective use of deep learning methods that require clean training data.

Method: First synthetically degrade microscopy images using an approximate forward model for deep tissue degradations, then train a neural network to learn the inverse degradation function from pairs of raw and synthetically degraded images.

Result: Networks trained this way can improve image quality out-of-distribution, including raw microscope data. Iterative predictions progressively improve contrast but gradually remove detailed structures, requiring a balance based on downstream analysis needs.

Conclusion: The proposed method successfully circumvents the need for unobtainable ground truth data by using synthetic degradation, enabling quality improvement of microscopy images while highlighting the trade-off between contrast enhancement and detail preservation.

Abstract: Microscopy images are crucial for life science research, allowing detailed inspection and characterization of cellular and tissue-level structures and functions. However, microscopy data are unavoidably affected by image degradations, such as noise, blur, or others. Many such degradations also contribute to a loss of image contrast, which becomes especially pronounced in deeper regions of thick samples. Today, best performing methods to increase the quality of images are based on Deep Learning approaches, which typically require ground truth (GT) data during training. Our inability to counteract blurring and contrast loss when imaging deep into samples prevents the acquisition of such clean GT data. The fact that the forward process of blurring and contrast loss deep into tissue can be modeled, allowed us to propose a new method that can circumvent the problem of unobtainable GT data. To this end, we first synthetically degraded the quality of microscopy images even further by using an approximate forward model for deep tissue image degradations. Then we trained a neural network that learned the inverse of this degradation function from our generated pairs of raw and degraded images. We demonstrated that networks trained in this way can be used out-of-distribution (OOD) to improve the quality of less severely degraded images, e.g. the raw data imaged in a microscope. Since the absolute level of degradation in such microscopy images can be stronger than the additional degradation introduced by our forward model, we also explored the effect of iterative predictions. Here, we observed that in each iteration the measured image contrast kept improving while detailed structures in the images got increasingly removed. Therefore, dependent on the desired downstream analysis, a balance between contrast improvement and retention of image details has to be found.

[422] RN-SDEs: Limited-Angle CT Reconstruction with Residual Null-Space Diffusion Stochastic Differential Equations

Jiaqi Guo, Santiago Lopez-Tapia, Wing Shun Li, Yunnan Wu, Marcelo Carignano, Martin Kröger, Vinayak P. Dravid, Igal Szleifer, Vadim Backman, Aggelos K. Katsaggelos

Main category: eess.IV

TL;DR: RN-SDEs use mean-reverting stochastic differential equations as diffusion models to solve Limited Angle CT reconstruction by combining learned priors with data consistency through Range-Null Space Decomposition.

DetailsMotivation: Limited Angle Computed Tomography (LACT) suffers from missing scanning angles causing artifacts and distortion, creating an ill-posed reconstruction problem that needs effective solutions.

Method: Proposed Residual Null-Space Diffusion Stochastic Differential Equations (RN-SDEs) using mean-reverting SDEs as diffusion models, leveraging learned priors and emphasizing data consistency via Range-Null Space Decomposition rectification.

Result: RN-SDEs restore high-quality images from severe degradation and achieve state-of-the-art performance on ChromSTEM and C4KC-KiTS datasets, with superior computational efficiency.

Conclusion: RN-SDEs provide an effective solution for LACT reconstruction by combining diffusion modeling with data consistency constraints, demonstrating strong performance and computational advantages.

Abstract: Computed tomography is a widely used imaging modality with applications ranging from medical imaging to material analysis. One major challenge arises from the lack of scanning information at certain angles, leading to distorted CT images with artifacts. This results in an ill-posed problem known as the Limited Angle Computed Tomography (LACT) reconstruction problem. To address this problem, we propose Residual Null-Space Diffusion Stochastic Differential Equations (RN-SDEs), which are a variant of diffusion models that characterize the diffusion process with mean-reverting (MR) stochastic differential equations. To demonstrate the generalizability of RN-SDEs, our experiments are conducted on two different LACT datasets, i.e., ChromSTEM and C4KC-KiTS. Through extensive experiments, we show that by leveraging learned Mean-Reverting SDEs as a prior and emphasizing data consistency using Range-Null Space Decomposition (RNSD) based rectification, RN-SDEs can restore high-quality images from severe degradation and achieve state-of-the-art performance in most LACT tasks. Additionally, we present a quantitative comparison of computational complexity and runtime efficiency, highlighting the superior effectiveness of our proposed approach.

[423] Style Content Decomposition-based Data Augmentation for Domain Generalizable Medical Image Segmentation

Zhiqiang Shen, Peng Cao, Jinzhu Yang, Osmar R. Zaiane, Zhaolin Chen

Main category: eess.IV

TL;DR: StyCona is a plug-and-play data augmentation method that decomposes medical images into style and content components, then augments both to improve domain generalization in segmentation tasks without changing model architecture.

DetailsMotivation: Medical imaging models suffer performance degradation due to domain shifts across different modalities, which include global style differences (illumination, contrast) and local content differences (anatomical structures).

Method: Factorize images into style codes and content maps, then perform Style-Content decomposition-based data augmentation (StyCona) on both global style and local content of source-domain images.

Result: Experiments on cardiac MRI and fundus photography segmentation show StyCona substantially improves model generalization and outperforms state-of-the-art domain generalization methods.

Conclusion: StyCona is an effective plug-and-play module that enables training well-generalized segmentation models without additional parameters or architecture changes.

Abstract: Due to domain shifts across diverse medical imaging modalities, learned segmentation models often suffer significant performance degradation during deployment. We posit that these domain shifts can generally be categorized into two main components: 1) “style” shifts, referring to global disparities in image properties such as illumination, contrast, and color; and 2) “content” shifts, which involve local discrepancies in anatomical structures. To address the domain shifts in medical image segmentation, we first factorize an image into style codes and content maps, explicitly modeling the “style” and “content” components. Building on this, we introduce a Style-Content decomposition-based data augmentation algorithm (StyCona), which performs augmentation on both the global style and local content of source-domain images, enabling the training of a well-generalized model for domain generalizable medical image segmentation. StyCona is a simple yet effective plug-and-play module that substantially improves model generalization without requiring additional training parameters or modifications to segmentation model architectures. Experiments on cardiac magnetic resonance imaging and fundus photography segmentation tasks, with single and multiple target domains respectively, demonstrate the effectiveness of StyCona and its superiority over state-of-the-art domain generalization methods. The code is available at https://github.com/Senyh/StyCona.

[424] TVC: Tokenized Video Compression with Ultra-Low Bit Rate

Lebin Zhou, Cihan Ruan, Nam Ling, Zhenghao Chen, Wei Wang, Wei Jiang

Main category: eess.IV

TL;DR: Tokenized Video Compression (TVC) is a dual-stream framework using discrete and continuous tokens for ultra-low bit rate video compression, achieving high perceptual quality through strategic masking and multi-scale fusion.

DetailsMotivation: Tokenized visual representations show promise in image compression but face challenges in video due to complex temporal dynamics and strict bit rate constraints, motivating the development of a token-based video compression solution.

Method: Uses Cosmos video tokenizer to extract discrete and continuous tokens, applies strategic masking and lossless compression to discrete tokens, quantizes continuous tokens, and fuses streams with ControlNet-based multi-scale integration.

Result: The framework operates effectively at ultra-low bit rates while maintaining high perceptual quality and stable fidelity in video reconstruction.

Conclusion: TVC demonstrates the practicality of tokenized video compression and opens new directions for semantics-aware, token-native video compression approaches.

Abstract: Tokenized visual representations have shown promise in image compression, yet their extension to video remains underexplored due to the challenges posed by complex temporal dynamics and stringent bit rate constraints. In this paper, we present tokenized video compression (TVC), a token-based dual-stream framework designed to operate effectively at ultra-low bit rates. TVC leverages the Cosmos video tokenizer to extract both discrete and continuous token streams. The discrete tokens are partially masked using a strategic masking scheme and then compressed losslessly with a discrete checkerboard context model to reduce transmission overhead. The masked tokens are reconstructed by a decoder-only Transformer with spatiotemporal token prediction. In parallel, the continuous tokens are quantized and compressed using a continuous checkerboard context model, providing complementary continuous information at ultra-low bit rates. At the decoder side, the two streams are fused with a ControlNet-based multi-scale integration module, ensuring high perceptual quality alongside stable fidelity in reconstruction. Overall, this work illustrates the practicality of tokenized video compression and points to new directions for semantics-aware, token-native approaches.

[425] Constructed Realities? Technical and Contextual Anomalies in a High-Profile Image

Matthias Wjst

Main category: eess.IV

TL;DR: Forensic analysis reveals inconsistencies in a widely circulated photo of Andrew Mountbatten-Windsor, Virginia Giuffre, and Ghislaine Maxwell, suggesting possible digital manipulation.

DetailsMotivation: To conduct forensic assessment of a pivotal photograph in public discourse and legal narratives involving high-profile individuals.

Method: Comparative analysis of multiple published versions examining lighting, posture, and physical interaction inconsistencies.

Result: Identified technical anomalies compatible with digital compositing rather than an unaltered snapshot, but definitive conclusions are unattainable due to lack of original print and verifiable audit trail.

Conclusion: The photograph may have been deliberately constructed and remains an unresolved yet symbolically charged artifact in a complex story of abuse, memory, and contested truth.

Abstract: This study offers a forensic assessment of a widely circulated photograph featuring Andrew Mountbatten-Windsor, Virginia Giuffre, and Ghislaine Maxwell - an image that has played a pivotal role in public discourse and legal narratives. By comparing multiple published versions, many inconsistencies emerge, including irregularities in lighting, posture, and physical interaction, which are more compatible with digital compositing than with an unaltered snapshot. Because no original print is available and -crucially- because a verifiable audit trail cannot be demanded for a potentially fabricated image, definitive conclusions remain unattainable. Even so, the technical and contextual anomalies indicate that the photograph may have been deliberately constructed. In the absence of further evidence, it remains an unresolved yet symbolically charged artifact within a complex story of abuse, memory, and contested truth.

[426] The Role of Radiographic Knee Alignment in Total Knee Replacement Outcomes and Opportunities for Artificial Intelligence-Driven Assessment

Zhisen Hu, Dominic Cullen, David S. Johnson, Aleksei Tiulpin, Timothy F. Cootes, Claudia Lindner

Main category: eess.IV

TL;DR: Automated knee alignment assessment from standard AP radiographs could help predict TKR outcomes more efficiently than manual methods using long-leg radiographs.

DetailsMotivation: Knee OA is a major health burden, and predicting poor TKR outcomes is crucial for patient selection and management. Traditional alignment measurement is manual, time-consuming, and requires specialized long-leg radiographs that aren't always available.

Method: Proposes automated methods for alignment assessment using standard anteroposterior (AP) knee radiographs instead of long-leg radiographs.

Result: Not explicitly stated in abstract, but implies that automated alignment assessment from standard radiographs is feasible and clinically valuable.

Conclusion: Automated alignment assessment from standard knee radiographs has potential clinical value for improving efficiency in the knee OA treatment pathway by enabling better prediction of TKR outcomes.

Abstract: Knee osteoarthritis (OA) is one of the most widespread and burdensome health problems [1-4]. Total knee replacement (TKR) may be offered as treatment for end-stage knee OA. Nevertheless, TKR is an invasive procedure involving prosthesis implantation at the knee joint, and around 10% of patients are dissatisfied following TKR [5,6]. Dissatisfaction is often assessed through patient-reported outcome measures (PROMs) [7], which are usually completed by patients and assessed by health professionals to evaluate the condition of TKR patients. In clinical practice, predicting poor TKR outcomes in advance could help optimise patient selection and improve management strategies. Radiographic knee alignment is an important biomarker for predicting TKR outcomes and long-term joint health. Abnormalities such as femoral or tibial deformities can directly influence surgical planning, implant selection, and postoperative recovery [8,9]. Traditional alignment measurement is manual, time-consuming, and requires long-leg radiographs, which are not always undertaken in clinical practice. Instead, standard anteroposterior (AP) knee radiographs are often the main imaging modality. Automated methods for alignment assessment in standard knee radiographs are potentially clinically valuable for improving efficiency in the knee OA treatment pathway.

[427] Electromagnetic Quantitative Inversion for Translationally Moving Targets via Phase Correlation Registration of Back-Projection Images

Yitao Lin, Dahai Dai, Shilong Sun, Yuchen Wu, Bo Pang

Main category: eess.IV

TL;DR: Proposes RMC-CC-CSI algorithm for electromagnetic inversion of moving targets using phase correlation registration and relative motion compensation with TDM-MIMO radar.

DetailsMotivation: To address the challenge of electromagnetic quantitative inversion for translationally moving targets, which conventional methods struggle with due to motion-induced artifacts.

Method: Uses TDM-MIMO radar with phase correlation registration of BP images for precise relative positioning, then applies relative motion compensation and iterative inversion with CC-CSI algorithm.

Result: RMC-CC-CSI shows accelerated convergence, enhanced reconstruction fidelity, and improved noise immunity compared to conventional CC-CSI for stationary targets.

Conclusion: The proposed framework effectively handles moving target inversion with better performance than stationary methods, though at increased computational cost.

Abstract: A novel electromagnetic quantitative inversion scheme for translationally moving targets via phase correlation registration of back-projection (BP) images is proposed. Based on a time division multiplexing multiple-input multiple-output (TDM-MIMO) radar architecture, the scheme first achieves high-precision relative positioning of the target, then applies relative motion compensation to perform iterative inversion on multi-cycle MIMO measurement data, thereby reconstructing the target’s electromagnetic parameters. As a general framework compatible with other mainstream inversion algorithms, we exemplify our approach by incorporating the classical cross-correlated contrast source inversion (CC-CSI) into iterative optimization step of the scheme, resulting in a new algorithm termed RMC-CC-CSI. Numerical and experimental results demonstrate that RMC-CC-CSI offers accelerated convergence, enhanced reconstruction fidelity, and improved noise immunity over conventional CC-CSI for stationary targets despite increased computational cost.

Last updated: 2025-11-28
Built with Hugo, theme modified on Stack