Daily arXiv Papers - 2025-08-20

AI-enhanced summaries of 0 research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] Fair Play in the Newsroom: Actor-Based Filtering Gender Discrimination in Text Corpora

Stefanie Urchs, Veronika Thurner, Matthias Aßenmacher, Christian Heumann, Stephanie Thiemichen

Main category: cs.CL

TL;DR: Extended pipeline for detecting and mitigating gender discrimination in large language models using actor-level metrics for sentiment, syntactic agency, and quotation analysis, applied to German newspaper corpus with improved gender balance but persistent subtle biases.

DetailsMotivation: Large language models reflect structural gender imbalances from training data, requiring better methods to detect and mitigate gender discrimination in text corpora.

Method: Extended actor-level pipeline with new metrics for sentiment, syntactic agency, and quotation asymmetries, supporting diagnostic analysis and exclusion-based balancing for fairer corpus construction.

Result: Substantial improvements in gender balance across multiple linguistic dimensions in German newspaper corpus (1980-2024), though subtle biases in sentiment and framing persist despite surface-level mitigation.

Conclusion: While surface-level gender asymmetries can be addressed through filtering, subtler forms of bias require ongoing attention; tools released for discourse-based fairness auditing and equitable corpus construction.

Abstract: Large language models are increasingly shaping digital communication, yet their outputs often reflect structural gender imbalances that originate from their training data. This paper presents an extended actor-level pipeline for detecting and mitigating gender discrimination in large-scale text corpora. Building on prior work in discourse-aware fairness analysis, we introduce new actor-level metrics that capture asymmetries in sentiment, syntactic agency, and quotation styles. The pipeline supports both diagnostic corpus analysis and exclusion-based balancing, enabling the construction of fairer corpora. We apply our approach to the taz2024full corpus of German newspaper articles from 1980 to 2024, demonstrating substantial improvements in gender balance across multiple linguistic dimensions. Our results show that while surface-level asymmetries can be mitigated through filtering and rebalancing, subtler forms of bias persist, particularly in sentiment and framing. We release the tools and reports to support further research in discourse-based fairness auditing and equitable corpus construction.

[2] MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents

Shilong Li, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong, Haoyang He, Hao Lu, Haozhe Zhang, Chenchen Jing, Zhen Li, Chuanhao Li, Jiayi Tian, Chenchen Zhang, Tianhao Peng, Yancheng He, Jihao Gu, Yuanxing Zhang, Jian Yang, Ge Zhang, Wenhao Huang, Wangchunshu Zhou, Zhaoxiang Zhang, Ruizhe Ding, Shilei Wen

Main category: cs.CL

TL;DR: MM-BrowseComp is a new benchmark with 224 multimodal questions that test AI agents’ ability to retrieve and reason with visual content during web browsing, revealing significant gaps in current models’ multimodal capabilities.

DetailsMotivation: Existing web browsing benchmarks focus primarily on textual information, overlooking the prevalence of multimodal content (images, videos) that requires visual reasoning capabilities.

Method: Created 224 hand-crafted challenging questions that incorporate images in prompts and require retrieving information from visual content on webpages, with verified checklists for fine-grained analysis.

Result: State-of-the-art models like OpenAI o3 with tools achieve only 29.02% accuracy, demonstrating suboptimal multimodal capabilities and lack of native multimodal reasoning.

Conclusion: Current AI models lack sufficient multimodal reasoning abilities for complex web browsing tasks that involve visual content, highlighting the need for improved multimodal integration in browsing agents.

Abstract: AI agents with advanced reasoning and tool use capabilities have demonstrated impressive performance in web browsing for deep search. While existing benchmarks such as BrowseComp evaluate these browsing abilities, they primarily focus on textual information, overlooking the prevalence of multimodal content. To bridge this gap, we introduce MM-BrowseComp, a novel benchmark comprising 224 challenging, hand-crafted questions specifically designed to assess agents' multimodal retrieval and reasoning capabilities. These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages. Consequently, methods relying solely on text prove insufficient for our benchmark. Additionally, we provide a verified checklist for each question, enabling fine-grained analysis of multimodal dependencies and reasoning paths. Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp reveals that even top models like OpenAI o3 with tools achieve only 29.02% accuracy, highlighting the suboptimal multimodal capabilities and lack of native multimodal reasoning in current models.

[3] Overcoming Latency Bottlenecks in On-Device Speech Translation: A Cascaded Approach with Alignment-Based Streaming MT

Zeeshan Ahmed, Frank Seide, Niko Moritz, Ju Lin, Ruiming Xie, Simone Merello, Zhe Liu, Christian Fuegen

Main category: cs.CL

TL;DR: Real-time streaming speech translation approach that integrates ASR and MT with efficient context management and beam-search pruning to reduce latency while maintaining quality.

DetailsMotivation: Addressing challenges in real-time on-device streaming speech translation, particularly the difficulty of achieving low-latency translation while maintaining quality compared to non-streaming systems.

Method: Proposes simultaneous translation approach using RNN-T based ASR systems, leveraging linguistic cues for context management, and employing efficient beam-search pruning techniques like time-out and forced finalization to maintain real-time performance.

Result: The approach outperforms baselines in both latency and quality metrics, significantly narrowing the quality gap with non-streaming translation systems for on-device bilingual conversational speech translation.

Conclusion: The techniques enable more accurate and efficient real-time speech translation, paving the way for practical on-device streaming translation applications.

Abstract: This paper tackles several challenges that arise when integrating Automatic Speech Recognition (ASR) and Machine Translation (MT) for real-time, on-device streaming speech translation. Although state-of-the-art ASR systems based on Recurrent Neural Network Transducers (RNN-T) can perform real-time transcription, achieving streaming translation in real-time remains a significant challenge. To address this issue, we propose a simultaneous translation approach that effectively balances translation quality and latency. We also investigate efficient integration of ASR and MT, leveraging linguistic cues generated by the ASR system to manage context and utilizing efficient beam-search pruning techniques such as time-out and forced finalization to maintain system’s real-time factor. We apply our approach to an on-device bilingual conversational speech translation and demonstrate that our techniques outperform baselines in terms of latency and quality. Notably, our technique narrows the quality gap with non-streaming translation systems, paving the way for more accurate and efficient real-time speech translation.

[4] Stands to Reason: Investigating the Effect of Reasoning on Idiomaticity Detection

Dylan Phelps, Rodrigo Wilkens, Edward Gow-Smith, Thomas Pickard, Maggie Mi, Aline Villavicencio

Main category: cs.CL

TL;DR: Reasoning capabilities in LLMs have limited and varied impact on idiomaticity detection, with smaller models benefiting less from chain-of-thought reasoning than larger models.

DetailsMotivation: To explore how reasoning capabilities affect idiomaticity detection performance in LLMs and examine the effect of model size, since understanding idiomatic expressions requires logical reasoning steps.

Method: Evaluated DeepSeek-R1 distillation models (1.5B to 70B parameters) across four idiomaticity detection datasets, testing chain-of-thought reasoning and providing definitions in prompts for smaller models.

Result: Reasoning effect was smaller than expected - CoT helped smaller models from Math-tuned versions but not to base model levels; larger models showed modest improvements. Larger models demonstrated good understanding while smaller models often failed to output actual meanings.

Conclusion: Providing definitions in prompts can improve performance for smaller models, but reasoning capabilities have limited impact on idiomaticity detection overall, with model size being a more significant factor.

Abstract: The recent trend towards utilisation of reasoning models has improved the performance of Large Language Models (LLMs) across many tasks which involve logical steps. One linguistic task that could benefit from this framing is idiomaticity detection, as a potentially idiomatic expression must first be understood before it can be disambiguated and serves as a basis for reasoning. In this paper, we explore how reasoning capabilities in LLMs affect idiomaticity detection performance and examine the effect of model size. We evaluate, as open source representative models, the suite of DeepSeek-R1 distillation models ranging from 1.5B to 70B parameters across four idiomaticity detection datasets. We find the effect of reasoning to be smaller and more varied than expected. For smaller models, producing chain-of-thought (CoT) reasoning increases performance from Math-tuned intermediate models, but not to the levels of the base models, whereas larger models (14B, 32B, and 70B) show modest improvements. Our in-depth analyses reveal that larger models demonstrate good understanding of idiomaticity, successfully producing accurate definitions of expressions, while smaller models often fail to output the actual meaning. For this reason, we also experiment with providing definitions in the prompts of smaller models, which we show can improve performance in some cases.

[5] Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts

Duygu Altinok

Main category: cs.CL

TL;DR: Novel approach enhances ASR by distilling contextual knowledge from LLaMA models into Whisper using token-level distillation with optimal transport and representation loss minimization, achieving significant improvements in WER, NER, capitalization, and punctuation on long audio transcripts.

DetailsMotivation: ASR systems struggle with maintaining syntactic and semantic accuracy in long audio transcripts, impacting tasks like Named Entity Recognition, capitalization, and punctuation.

Method: Two strategies: (1) token level distillation with optimal transport to align dimensions and sequence lengths, (2) representation loss minimization between sentence embeddings of Whisper and LLaMA to blend syntax and semantics.

Result: Significant improvements in Word Error Rate (WER), NER, capitalization, and punctuation success on the Spoken Wikipedia dataset (long audios with rich entities).

Conclusion: The work highlights the value of integrating linguistic context into transcription and sets a foundation for robust, context-aware ASR in longform speech.

Abstract: ASR systems often struggle with maintaining syntactic and semantic accuracy in long audio transcripts, impacting tasks like Named Entity Recognition (NER), capitalization, and punctuation. We propose a novel approach that enhances ASR by distilling contextual knowledge from LLaMA models into Whisper. Our method uses two strategies: (1) token level distillation with optimal transport to align dimensions and sequence lengths, and (2) representation loss minimization between sentence embeddings of Whisper and LLaMA, blending syntax and semantics. Evaluations on the Spoken Wikipedia dataset, a benchmark with long audios and rich entities demonstrate significant improvements in Word Error Rate (WER), NER, capitalization, and punctuation success. By introducing novel NER metrics and exploring semantics aware ASR, our work highlights the value of integrating linguistic context into transcription, setting a foundation for robust, context-aware ASR in longform speech.

[6] Datarus-R1: An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis

Ayoub Ben Chaliah, Hela Dellagi

Main category: cs.CL

TL;DR: Datarus-R1-14B is a 14B parameter language model fine-tuned from Qwen 2.5-14B-Instruct to serve as a virtual data analyst and graduate-level problem solver, featuring dual reasoning interfaces and trained on full analytical trajectories with a novel training pipeline.

DetailsMotivation: To create a more effective virtual data analyst that avoids common issues like format collapse and verbosity in RL-aligned LLMs, while providing both agentic code execution and compact reasoning capabilities for complex quantitative problems.

Method: The model uses a training pipeline with: (1) trajectory-centric synthetic data generator producing 144K tagged notebook episodes, (2) dual-reward framework combining structural signals with Hierarchical Reward Model scoring, (3) memory-optimized GRPO implementation with KV-cache reuse and reference-model sharding, and (4) cosine curriculum shifting from structural to semantic focus.

Result: Datarus achieves up to 30% higher accuracy on AIME 2024/2025 and LiveCodeBench benchmarks compared to similar-size models, reaching performance levels of larger models like QwQ-32B while emitting 18-49% fewer tokens per solution.

Conclusion: The approach demonstrates that training on full analytical trajectories with dual-reward optimization and curriculum learning can create highly efficient problem-solving models that avoid common pitfalls of contemporary systems while excelling at complex quantitative reasoning tasks.

Abstract: We present Datarus-R1-14B, a 14 B-parameter open-weights language model fine-tuned from Qwen 2.5-14B-Instruct to act as a virtual data analyst and graduate-level problem solver. Datarus is trained not on isolated question-answer pairs but on full analytical trajectories including reasoning steps, code execution, error traces, self-corrections, and final conclusions, all captured in a ReAct-style notebook format spanning finance, medicine, numerical analysis, and other quantitative domains. Our training pipeline combines (i) a trajectory-centric synthetic data generator that yielded 144 000 tagged notebook episodes, (ii) a dual-reward framework blending a lightweight tag-based structural signal with a Hierarchical Reward Model (HRM) that scores both single-step soundness and end-to-end coherence, and (iii) a memory-optimized implementation of Group Relative Policy Optimization (GRPO) featuring KV-cache reuse, sequential generation, and reference-model sharding. A cosine curriculum smoothly shifts emphasis from structural fidelity to semantic depth, reducing the format collapse and verbosity that often plague RL-aligned LLMs. A central design choice in Datarus is it dual reasoning interface. In agentic mode the model produces ReAct-tagged steps that invoke Python tools to execute real code; in reflection mode it outputs compact Chain-of-Thought (CoT) traces delimited by and tags. On demanding postgraduate-level problems, Datarus exhibits an “AHA-moment” pattern: it sketches hypotheses, revises them once or twice, and converges avoiding the circular, token-inflating loops common to contemporary systems. Across standard public benchmarks Datarus surpasses similar size models and even reaches the level of larger reasoning models such as QwQ-32B achieving up to 30% higher accuracy on AIME 2024/2025 and LiveCodeBench while emitting 18-49% fewer tokens per solution.

[7] ALIGN: Word Association Learning for Cross-Cultural Generalization in Large Language Models

Chunhua Liu, Kabir Manandhar Shrestha, Sukai Huang

Main category: cs.CL

TL;DR: Parameter-efficient fine-tuning on native speakers’ free word-association norms improves cultural alignment in LLMs, boosting association accuracy and shifting value distributions toward target cultures without costly retraining.

DetailsMotivation: LLMs reflect distributional bias from over-represented languages and viewpoints in pre-training corpora, but modeling and aligning culture remains challenging due to limited cultural knowledge and effective learning approaches.

Method: Parameter-efficient fine-tuning on native speakers’ free word-association norms from Small-World-of-Words project (English-US and Mandarin), adapting Llama-3.1-8B and Qwen-2.5-7B via supervised fine-tuning (SFT) and PPO-based preference optimization.

Result: SFT boosts held-out association Precision at 5 by 16-20% in English and 43-165% in Mandarin, lifts median concreteness by +0.20, attains human-level valence and arousal. Fine-tuned models shift answer distributions toward target culture on World-Values-Survey questions, with Chinese-aligned responses doubling and US bias dropping by one-third on high-tension items. 7-8B models rival or beat vanilla 70B baselines.

Conclusion: A few million culture-grounded associations can instill value alignment without costly retraining, highlighting the promise of cognitive-grounded approaches for improving cultural alignment in AI models.

Abstract: As large language models (LLMs) increasingly mediate cross-cultural communication, their behavior still reflects the distributional bias of the languages and viewpoints that are over-represented in their pre-training corpora. Yet, it remains a challenge to model and align culture due to limited cultural knowledge and a lack of exploration into effective learning approaches. We introduce a cost-efficient, cognitively grounded remedy: parameter-efficient fine-tuning on native speakers’ free word-association norms, which encode implicit cultural schemas. Leveraging English-US and Mandarin associations from the Small-World-of-Words project, we adapt Llama-3.1-8B and Qwen-2.5-7B via supervised fine-tuning (SFT) and PPO-based preference optimization. SFT boosts held-out association Precision at 5 by 16-20% in English and 43-165% in Mandarin, lifts median concreteness by +0.20, and attains human-level valence and arousal. These lexical gains transfer: on World-Values-Survey questions, fine-tuned models shift answer distributions toward the target culture, and on a 50-item high-tension subset, Qwen’s Chinese-aligned responses double while Llama’s US bias drops by one-third. Our 7-8B models rival or beat vanilla 70B baselines, showing that a few million culture-grounded associations can instill value alignment without costly retraining. Our work highlights both the promise and the need for future research grounded in human cognition in improving cultural alignment in AI models.

[8] ProMed: Shapley Information Gain Guided Reinforcement Learning for Proactive Medical LLMs

Hongxin Ding, Baixiang Huang, Yue Fang, Weibin Liao, Xinke Jiang, Zheng Li, Junfeng Zhao, Yasha Wang

Main category: cs.CL

TL;DR: ProMed is a reinforcement learning framework that transforms medical LLMs from reactive to proactive question-asking agents using Shapley Information Gain rewards to quantify clinical utility of questions.

DetailsMotivation: Current medical LLMs operate reactively by generating answers without seeking additional information, which risks incorrect diagnoses in interactive clinical settings where physicians need to actively gather information from patients.

Method: Two-stage training pipeline: 1) SIG-Guided Model Initialization using Monte Carlo Tree Search to construct high-reward interaction trajectories, and 2) SIG-Augmented Policy Optimization with SIG-guided Reward Distribution Mechanism that assigns higher rewards to informative questions.

Result: ProMed outperforms state-of-the-art methods by an average of 6.29% and delivers 54.45% gain over reactive paradigm, while generalizing robustly to out-of-domain cases on two newly curated medical benchmarks.

Conclusion: The proposed ProMed framework successfully transitions medical LLMs to a proactive questioning paradigm, significantly improving their diagnostic accuracy and clinical utility through reinforcement learning with Shapley Information Gain rewards.

Abstract: Interactive medical questioning is essential in real-world clinical consultations, where physicians must actively gather information from patients. While medical Large Language Models (LLMs) have shown impressive capabilities in static medical question answering, they predominantly operate under a reactive paradigm: generating answers directly without seeking additional information, which risks incorrect diagnoses in such interactive settings. To address this limitation, we propose ProMed, a reinforcement learning (RL) framework that transitions medical LLMs toward a proactive paradigm, equipping them with the ability to ask clinically valuable questions before decision-making. At the core of ProMed is the Shapley Information Gain (SIG) reward, which quantifies the clinical utility of each question by combining the amount of newly acquired information with its contextual importance, estimated via Shapley values. We integrate SIG into a two-stage training pipeline: (1) SIG-Guided Model Initialization uses Monte Carlo Tree Search (MCTS) to construct high-reward interaction trajectories to supervise the model, and (2) SIG-Augmented Policy Optimization, which integrates SIG and enhances RL with a novel SIG-guided Reward Distribution Mechanism that assigns higher rewards to informative questions for targeted optimization. Extensive experiments on two newly curated partial-information medical benchmarks demonstrate that ProMed significantly outperforms state-of-the-art methods by an average of 6.29% and delivers a 54.45% gain over the reactive paradigm, while also generalizing robustly to out-of-domain cases.

[9] Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation

Hassan Barmandah

Main category: cs.CL

TL;DR: This paper presents a method to improve Saudi dialect generation in Arabic LLMs using LoRA-tuning with dialect token control, achieving significant improvements in dialect accuracy and reducing MSA leakage.

DetailsMotivation: Arabic LLMs are dominated by Modern Standard Arabic with limited support for Saudi dialects like Najdi and Hijazi, which hinders their ability to capture authentic dialectal variation.

Method: Used a privately curated Saudi Dialect Instruction dataset (5,466 synthetic pairs) to LoRA-tune ALLaM-7B-Instruct-preview. Investigated two variants: Dialect-Token training (prepending explicit dialect tags) and No-Token training (omitting tags).

Result: Dialect-Token model achieved best control: Saudi dialect rate increased from 47.97% to 84.21%, MSA leakage reduced from 32.63% to 6.21%, with improved fidelity metrics (chrF++ +3.53, BERTScore +0.059). Both variants outperformed strong baseline models.

Conclusion: The approach successfully improves Saudi dialect generation in Arabic LLMs while avoiding metadata-tag echoing issues. Code and datasheet are released for verification, but dataset and model weights are not publicly available.

Abstract: Large language models (LLMs) for Arabic are still dominated by Modern Standard Arabic (MSA), with limited support for Saudi dialects such as Najdi and Hijazi. This underrepresentation hinders their ability to capture authentic dialectal variation. Using a privately curated Saudi Dialect Instruction dataset (Hijazi and Najdi; 5,466 synthetic instruction-response pairs; 50/50 split), we LoRA-tune ALLaM-7B-Instruct-preview, the first foundation model developed in Saudi Arabia, for Saudi dialect generation. We investigate two variants: (i) Dialect-Token training, which prepends an explicit dialect tag to the instruction, and (ii) No-Token training, which omits the tag at formatting time. Evaluation on a held-out test set combines an external dialect classifier with text fidelity metrics (chrF++ and BERTScore) and diversity measures. The Dialect-Token model achieves the best control, raising the Saudi rate from 47.97% to 84.21% and reducing MSA leakage from 32.63% to 6.21%; fidelity also improves (chrF++ +3.53, BERTScore +0.059). Both LoRA variants outperform strong generic instruction models (Falcon-7B-Instruct, Llama-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, AceGPT-v2-8B-Chat, JAIS-13B-Chat) in dialect control and fidelity, while avoiding metadata-tag echoing that these baselines frequently exhibit. We do not release the dataset or any model weights/adapters; instead, we release training/evaluation/inference code and a detailed datasheet (schema and aggregate statistics) to support independent verification.

[10] MATA (māta): Mindful Assessment of the Telugu Abilities of Large Language Models

Chalamalasetti Kranti, Sowmya Vajjala

Main category: cs.CL

TL;DR: MATA is a new evaluation dataset for assessing LLM capabilities in Telugu language, featuring 729 multiple-choice and open-ended questions across diverse linguistic dimensions.

DetailsMotivation: To address the lack of comprehensive evaluation benchmarks for assessing Large Language Models' capabilities in low-resource languages like Telugu, and to understand model limitations through fine-grained analysis.

Method: Created a dataset of 729 carefully curated questions (multiple-choice and open-ended) spanning diverse linguistic dimensions. Evaluated 11 open-weight and closed-source LLMs, analyzed their performance patterns, and compared LLM-as-a-judge evaluation with human evaluation for open-ended questions.

Result: The study revealed that LLMs rely on superficial heuristics (answer position and distractor patterns) for multiple-choice questions. Also found that LLM-as-a-judge evaluation reliability varies in low-resource language contexts compared to human evaluation.

Conclusion: Fine-grained evaluation is essential for understanding model limitations and can inform development of more linguistically capable LLMs. MATA serves as a foundation for future research in Telugu NLP and highlights the need for robust evaluation methodologies in low-resource languages.

Abstract: In this paper, we introduce MATA, a novel evaluation dataset to assess the ability of Large Language Models (LLMs) in Telugu language, comprising 729 carefully curated multiple-choice and open-ended questions that span diverse linguistic dimensions. We evaluate 11 open-weight and closed-source LLMs on our dataset and present a fine-grained analysis of their performance. Further, we empirically show how LLMs rely on superficial heuristics such as answer position and distractor patterns for multiple-choice questions. Finally, we also compare LLM-as-a-judge evaluation with human evaluation for open-ended questions and draw some conclusions on its reliability in a low-resource language. We argue that such fine-grained evaluation is essential for understanding model limitations and can inform the development of more linguistically capable LLMs, while also serving as a foundation for future research in Telugu NLP.

[11] Compressed Models are NOT Trust-equivalent to Their Large Counterparts

Rohit Raj Rai, Chirag Kothari, Siddhesh Shelke, Amit Awekar

Main category: cs.CL

TL;DR: Compressed deep learning models may have similar accuracy to original large models but lack trust-equivalence due to poor interpretability alignment and calibration mismatch.

DetailsMotivation: To determine if compressed models can be trusted as drop-in replacements for large models, going beyond just accuracy parity to examine interpretability and calibration aspects.

Method: Proposed a two-dimensional framework: 1) interpretability alignment measured via LIME and SHAP tests, 2) calibration similarity assessed through ECE, MCE, Brier Score and reliability diagrams. Experiments used BERT-base and its compressed variants on text classification tasks.

Result: Found low interpretability alignment and significant calibration mismatch even when accuracies were nearly identical between compressed and original models.

Conclusion: Compressed models are not trust-equivalent to large models, requiring careful assessment beyond performance parity before deployment as replacements.

Abstract: Large Deep Learning models are often compressed before being deployed in a resource-constrained environment. Can we trust the prediction of compressed models just as we trust the prediction of the original large model? Existing work has keenly studied the effect of compression on accuracy and related performance measures. However, performance parity does not guarantee trust-equivalence. We propose a two-dimensional framework for trust-equivalence evaluation. First, interpretability alignment measures whether the models base their predictions on the same input features. We use LIME and SHAP tests to measure the interpretability alignment. Second, calibration similarity measures whether the models exhibit comparable reliability in their predicted probabilities. It is assessed via ECE, MCE, Brier Score, and reliability diagrams. We conducted experiments using BERT-base as the large model and its multiple compressed variants. We focused on two text classification tasks: natural language inference and paraphrase identification. Our results reveal low interpretability alignment and significant mismatch in calibration similarity. It happens even when the accuracies are nearly identical between models. These findings show that compressed models are not trust-equivalent to their large counterparts. Deploying compressed models as a drop-in replacement for large models requires careful assessment, going beyond performance parity.

[12] A Comparative Study of Decoding Strategies in Medical Text Generation

Oriana Presacan, Alireza Nik, Vajira Thambawita, Bogdan Ionescu, Michael Riegler

Main category: cs.CL

TL;DR: Beam search outperforms stochastic decoding strategies in medical LLM tasks, with larger models showing better quality but longer inference times and no decoding robustness advantage.

DetailsMotivation: To investigate how different decoding strategies affect output quality in healthcare applications where accuracy is critical, as this impact remains underexplored.

Method: Evaluated 11 decoding strategies across five medical tasks (translation, summarization, QA, dialogue, image captioning) using medically specialized and general-purpose LLMs of various sizes.

Result: Deterministic strategies (especially beam search) outperformed stochastic ones; larger models achieved higher scores but with longer inference times; medical LLMs showed no overall performance advantage and greater sensitivity to decoding choices.

Conclusion: Decoding strategy selection is crucial in medical applications as its influence can sometimes exceed model choice, with beam search recommended for optimal performance.

Abstract: Large Language Models (LLMs) rely on various decoding strategies to generate text, and these choices can significantly affect output quality. In healthcare, where accuracy is critical, the impact of decoding strategies remains underexplored. We investigate this effect in five open-ended medical tasks, including translation, summarization, question answering, dialogue, and image captioning, evaluating 11 decoding strategies with medically specialized and general-purpose LLMs of different sizes. Our results show that deterministic strategies generally outperform stochastic ones: beam search achieves the highest scores, while {\eta} and top-k sampling perform worst. Slower decoding methods tend to yield better quality. Larger models achieve higher scores overall but have longer inference times and are no more robust to decoding. Surprisingly, while medical LLMs outperform general ones in two of the five tasks, statistical analysis shows no overall performance advantage and reveals greater sensitivity to decoding choice. We further compare multiple evaluation metrics and find that correlations vary by task, with MAUVE showing weak agreement with BERTScore and ROUGE, as well as greater sensitivity to the decoding strategy. These results highlight the need for careful selection of decoding methods in medical applications, as their influence can sometimes exceed that of model choice.

[13] Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM

Dariia Puhach, Amir H. Payberah, Éva Székely

Main category: cs.CL

TL;DR: Speech-LLMs like Bark show gender awareness but no systematic bias in speaker assignment for gendered prompts

DetailsMotivation: To investigate whether Speech-LLMs exhibit gender bias similar to text-based LLMs, using speaker assignment as an explicit bias indicator

Method: Evaluated Bark TTS model’s default speaker assignments for two datasets: gender-stereotyped professions and gender-colored words with gendered connotations

Result: Bark does not show systematic gender bias but demonstrates gender awareness and some gender inclinations in speaker selection

Conclusion: Speech-LLMs have gender awareness without systematic bias, making speaker assignment a useful tool for bias investigation in speech models

Abstract: Similar to text-based Large Language Models (LLMs), Speech-LLMs exhibit emergent abilities and context awareness. However, whether these similarities extend to gender bias remains an open question. This study proposes a methodology leveraging speaker assignment as an analytic tool for bias investigation. Unlike text-based models, which encode gendered associations implicitly, Speech-LLMs must produce a gendered voice, making speaker selection an explicit bias cue. We evaluate Bark, a Text-to-Speech (TTS) model, analyzing its default speaker assignments for textual prompts. If Bark’s speaker selection systematically aligns with gendered associations, it may reveal patterns in its training data or model design. To test this, we construct two datasets: (i) Professions, containing gender-stereotyped occupations, and (ii) Gender-Colored Words, featuring gendered connotations. While Bark does not exhibit systematic bias, it demonstrates gender awareness and has some gender inclinations.

[14] AdaDocVQA: Adaptive Framework for Long Document Visual Question Answering in Low-Resource Settings

Haoxuan Li, Wei Song, Aofan Liu, Peiwu Qin

Main category: cs.CL

TL;DR: AdaDocVQA is an adaptive framework that improves Document VQA performance for long documents in low-resource settings through hybrid text retrieval, intelligent data augmentation, and adaptive ensemble inference.

DetailsMotivation: Document VQA faces challenges with long documents in low-resource environments due to context limitations and insufficient training data, particularly for languages like Japanese.

Method: Three core innovations: hybrid text retrieval architecture for document segmentation, intelligent data augmentation pipeline generating reasoning QA pairs with multi-level verification, and adaptive ensemble inference with dynamic configuration and early stopping.

Result: Achieved 83.04% accuracy on Yes/No questions, 52.66% on factual questions, 44.12% on numerical questions in JDocQA, and 59% accuracy on LAVA dataset, establishing new state-of-the-art results for Japanese document VQA.

Conclusion: The framework provides a scalable foundation for low-resource languages and specialized domains, with each component making meaningful contributions as confirmed by ablation studies.

Abstract: Document Visual Question Answering (Document VQA) faces significant challenges when processing long documents in low-resource environments due to context limitations and insufficient training data. This paper presents AdaDocVQA, a unified adaptive framework addressing these challenges through three core innovations: a hybrid text retrieval architecture for effective document segmentation, an intelligent data augmentation pipeline that automatically generates high-quality reasoning question-answer pairs with multi-level verification, and adaptive ensemble inference with dynamic configuration generation and early stopping mechanisms. Experiments on Japanese document VQA benchmarks demonstrate substantial improvements with 83.04% accuracy on Yes/No questions, 52.66% on factual questions, and 44.12% on numerical questions in JDocQA, and 59% accuracy on LAVA dataset. Ablation studies confirm meaningful contributions from each component, and our framework establishes new state-of-the-art results for Japanese document VQA while providing a scalable foundation for other low-resource languages and specialized domains. Our code available at: https://github.com/Haoxuanli-Thu/AdaDocVQA.

[15] CRISP: Persistent Concept Unlearning via Sparse Autoencoders

Tomer Ashuach, Dana Arad, Aaron Mueller, Martin Tutek, Yonatan Belinkov

Main category: cs.CL

TL;DR: CRISP is a parameter-efficient method for persistent concept unlearning in LLMs using sparse autoencoders to identify and suppress harmful knowledge features while preserving model utility.

DetailsMotivation: Existing SAE-based unlearning methods operate at inference time and don't create persistent parameter changes, making them vulnerable to bypassing by malicious actors with parameter access.

Method: CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations to achieve persistent concept unlearning.

Result: Outperforms prior approaches on safety-critical unlearning tasks from WMDP benchmark, successfully removing harmful knowledge while preserving general and in-domain capabilities.

Conclusion: Feature-level analysis shows CRISP achieves semantically coherent separation between target and benign concepts, enabling precise suppression of harmful features.

Abstract: As large language models (LLMs) are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while preserving model utility has become paramount. Recent work has explored sparse autoencoders (SAEs) to perform precise interventions on monosemantic features. However, most SAE-based methods operate at inference time, which does not create persistent changes in the model’s parameters. Such interventions can be bypassed or reversed by malicious actors with parameter access. We introduce CRISP, a parameter-efficient method for persistent concept unlearning using SAEs. CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations. We experiment with two LLMs and show that our method outperforms prior approaches on safety-critical unlearning tasks from the WMDP benchmark, successfully removing harmful knowledge while preserving general and in-domain capabilities. Feature-level analysis reveals that CRISP achieves semantically coherent separation between target and benign concepts, allowing precise suppression of the target features.

[16] ViExam: Are Vision Language Models Better than Humans on Vietnamese Multimodal Exam Questions?

Vy Tuong Dang, An Vo, Quang Tau, Duc Dm, Daeyoung Kim

Main category: cs.CL

TL;DR: VLMs perform poorly on Vietnamese educational assessments, with state-of-the-art models achieving only 57.74% accuracy compared to human average of 66.54%, and cross-lingual prompting actually decreases performance.

DetailsMotivation: To evaluate how well vision language models trained predominantly on English data can handle real-world cross-lingual multimodal reasoning in low-resource languages like Vietnamese, particularly in educational contexts.

Method: Created ViExam benchmark with 2,548 multimodal questions across 7 academic domains, tested state-of-the-art and open-source VLMs, evaluated cross-lingual prompting with English instructions, and assessed human-in-the-loop collaboration.

Result: VLMs significantly underperform humans (57.74% vs 66.54% average human performance), cross-lingual prompting decreases accuracy by 1%, and human collaboration improves performance by 5 percentage points.

Conclusion: Current VLMs struggle with Vietnamese multimodal educational content despite English training, highlighting the need for better cross-lingual multimodal capabilities and showing that simple translation approaches don’t work effectively.

Abstract: Vision language models (VLMs) demonstrate remarkable capabilities on English multimodal tasks, but their performance on low-resource languages with genuinely multimodal educational content remains largely unexplored. In this work, we test how VLMs perform on Vietnamese educational assessments, investigating whether VLMs trained predominantly on English data can handle real-world cross-lingual multimodal reasoning. Our work presents the first comprehensive evaluation of VLM capabilities on multimodal Vietnamese exams through proposing ViExam, a benchmark containing 2,548 multimodal questions. We find that state-of-the-art VLMs achieve only 57.74% while open-source models achieve 27.70% mean accuracy across 7 academic domains, including Mathematics, Physics, Chemistry, Biology, Geography, Driving Test, and IQ Test. Most VLMs underperform average human test-takers (66.54%), with only the thinking VLM o3 (74.07%) exceeding human average performance, yet still falling substantially short of human best performance (99.60%). Cross-lingual prompting with English instructions while maintaining Vietnamese content fails to improve performance, decreasing accuracy by 1 percentage point for SOTA VLMs. Human-in-the-loop collaboration can partially improve VLM performance by 5 percentage points. Code and data are available at: https://vi-exam.github.io.

[17] Generics and Default Reasoning in Large Language Models

James Ravi Kirkpatrick, Rachel Katharine Sterken

Main category: cs.CL

TL;DR: Evaluation of 28 LLMs on defeasible reasoning with generics, showing varied performance, CoT prompting degradation, and models struggling with defeasible vs deductive inference distinctions.

DetailsMotivation: To assess LLM capabilities in handling complex exception-permitting generic generalizations central to non-monotonic logic and default reasoning, which are important for linguistics, philosophy, and cognitive science.

Method: Tested 28 large language models on 20 defeasible reasoning patterns involving generic generalizations, using different prompting styles (zero-shot, few-shot, chain-of-thought) at temperature 0.

Result: Performance varied widely across models and prompting styles. Few-shot prompting modestly improved some models, but CoT prompting caused significant performance degradation (-11.14% mean accuracy drop). Most models failed to distinguish defeasible from deductive inference or misinterpreted generics as universal statements.

Conclusion: Current LLMs show both promise and significant limitations in default reasoning, particularly in handling the complex exception-permitting nature of generic statements and distinguishing between different types of logical inference.

Abstract: This paper evaluates the capabilities of 28 large language models (LLMs) to reason with 20 defeasible reasoning patterns involving generic generalizations (e.g., ‘Birds fly’, ‘Ravens are black’) central to non-monotonic logic. Generics are of special interest to linguists, philosophers, logicians, and cognitive scientists because of their complex exception-permitting behaviour and their centrality to default reasoning, cognition, and concept acquisition. We find that while several frontier models handle many default reasoning problems well, performance varies widely across models and prompting styles. Few-shot prompting modestly improves performance for some models, but chain-of-thought (CoT) prompting often leads to serious performance degradation (mean accuracy drop -11.14%, SD 15.74% in models performing above 75% accuracy in zero-shot condition, temperature 0). Most models either struggle to distinguish between defeasible and deductive inference or misinterpret generics as universal statements. These findings underscore both the promise and limits of current LLMs for default reasoning.

[18] Prediction is not Explanation: Revisiting the Explanatory Capacity of Mapping Embeddings

Hanna Herasimchyk, Alhassan Abdelhalim, Sören Laue, Michaela Regneri

Main category: cs.CL

TL;DR: Prediction accuracy of semantic features from word embeddings doesn’t reliably indicate genuine knowledge encoding - methods can predict random information due to algorithmic upper bounds rather than meaningful semantic representation.

DetailsMotivation: To challenge the assumption that accurate prediction of semantic features from word embeddings implies the embeddings contain corresponding knowledge, as this is essential for improving AI interpretability.

Method: Examined common explanation methods that map word embeddings to human-interpretable semantic features (feature norms), demonstrating these methods can successfully predict even random information.

Result: Prediction accuracy is predominantly determined by algorithmic upper bounds rather than meaningful semantic representation, and geometric similarity in vector spaces primarily drives the results rather than genuine semantic emergence.

Conclusion: Comparisons between datasets based solely on prediction performance are unreliable for indicating which dataset is better captured by word embeddings, as current methods reflect geometric properties rather than true semantic knowledge encoding.

Abstract: Understanding what knowledge is implicitly encoded in deep learning models is essential for improving the interpretability of AI systems. This paper examines common methods to explain the knowledge encoded in word embeddings, which are core elements of large language models (LLMs). These methods typically involve mapping embeddings onto collections of human-interpretable semantic features, known as feature norms. Prior work assumes that accurately predicting these semantic features from the word embeddings implies that the embeddings contain the corresponding knowledge. We challenge this assumption by demonstrating that prediction accuracy alone does not reliably indicate genuine feature-based interpretability. We show that these methods can successfully predict even random information, concluding that the results are predominantly determined by an algorithmic upper bound rather than meaningful semantic representation in the word embeddings. Consequently, comparisons between datasets based solely on prediction performance do not reliably indicate which dataset is better captured by the word embeddings. Our analysis illustrates that such mappings primarily reflect geometric similarity within vector spaces rather than indicating the genuine emergence of semantic properties.

[19] EEG-MedRAG: Enhancing EEG-based Clinical Decision-Making via Hierarchical Hypergraph Retrieval-Augmented Generation

Yi Wang, Haoran Luo, Lu Meng

Main category: cs.CL

TL;DR: EEG-MedRAG is a hypergraph-based framework that integrates EEG data, patient cases, and medical knowledge for improved clinical decision support through semantic-temporal retrieval and diagnostic generation.

DetailsMotivation: The need to efficiently retrieve and interpret large-scale, multi-source, heterogeneous EEG data for neuroscience and clinical applications, addressing challenges in handling diverse EEG data sources and enabling effective clinical decision support.

Method: A three-layer hypergraph-based retrieval-augmented generation framework that unifies EEG domain knowledge, individual patient cases, and a large-scale repository into a traversable n-ary relational hypergraph, enabling joint semantic-temporal retrieval and causal-chain diagnostic generation.

Result: EEG-MedRAG significantly outperforms TimeRAG and HyperGraphRAG in answer accuracy and retrieval performance. The framework demonstrates strong potential for real-world clinical decision support.

Conclusion: The proposed EEG-MedRAG framework effectively addresses the challenges of EEG data retrieval and interpretation, showing superior performance over existing methods and offering promising applications for clinical neuroscience practice.

Abstract: With the widespread application of electroencephalography (EEG) in neuroscience and clinical practice, efficiently retrieving and semantically interpreting large-scale, multi-source, heterogeneous EEG data has become a pressing challenge. We propose EEG-MedRAG, a three-layer hypergraph-based retrieval-augmented generation framework that unifies EEG domain knowledge, individual patient cases, and a large-scale repository into a traversable n-ary relational hypergraph, enabling joint semantic-temporal retrieval and causal-chain diagnostic generation. Concurrently, we introduce the first cross-disease, cross-role EEG clinical QA benchmark, spanning seven disorders and five authentic clinical perspectives. This benchmark allows systematic evaluation of disease-agnostic generalization and role-aware contextual understanding. Experiments show that EEG-MedRAG significantly outperforms TimeRAG and HyperGraphRAG in answer accuracy and retrieval, highlighting its strong potential for real-world clinical decision support. Our data and code are publicly available at https://github.com/yi9206413-boop/EEG-MedRAG.

[20] Sycophancy under Pressure: Evaluating and Mitigating Sycophantic Bias via Adversarial Dialogues in Scientific QA

Kaiwei Zhang, Qi Jia, Zijian Chen, Wei Sun, Xiangyang Zhu, Chunyi Li, Dandan Zhu, Guangtao Zhai

Main category: cs.CL

TL;DR: LLMs exhibit sycophancy - aligning with user beliefs regardless of truth. The paper introduces an evaluation framework to measure this in scientific QA and proposes Pressure-Tune, a fine-tuning method that improves factual consistency without compromising accuracy.

DetailsMotivation: Sycophancy in LLMs poses serious risks in high-stakes scientific settings where factual accuracy is crucial, but this phenomenon remains underexamined in factual QA contexts despite preference-based alignment techniques reinforcing this behavior.

Method: Developed a unified evaluation framework with adversarial prompting and metrics (misleading resistance, sycophancy resistance). Proposed Pressure-Tune - lightweight post-training fine-tuning on synthetic adversarial dialogues with chain-of-thought rationales that reject misinformation.

Result: Systematic evaluations revealed pervasive sycophantic tendencies across models, driven more by alignment strategy than model size. Pressure-Tune significantly enhanced sycophancy resistance without compromising accuracy or responsiveness to valid feedback.

Conclusion: Pressure-Tune offers a practical pathway toward more truthful and principled model behavior in scientific QA, effectively mitigating sycophancy while maintaining model responsiveness and accuracy.

Abstract: Large language models (LLMs), while increasingly used in domains requiring factual rigor, often display a troubling behavior: sycophancy, the tendency to align with user beliefs regardless of correctness. This tendency is reinforced by preference-based alignment techniques that optimize for user satisfaction but can undermine truthfulness. While relatively benign in casual dialogue, sycophancy poses serious risks in high-stakes settings such as scientific question answering (QA), where model outputs may shape collaborative reasoning, decision-making, and knowledge formation. Despite its importance, this phenomenon remains underexamined in factual QA contexts. We address this gap by introducing a unified evaluation framework to quantify the impact of sycophantic context on model behavior in scientific QA, measuring how much user-imposed social pressure distorts model outputs. The framework incorporates adversarial prompting setups and targeted metrics, such as misleading resistance and sycophancy resistance, that capture a model’s ability to maintain factual consistency under misleading cues. Systematic evaluations across open-source and proprietary models reveal pervasive sycophantic tendencies, driven more by alignment strategy than by model size. To mitigate this issue, we propose Pressure-Tune, a lightweight post-training method that fine-tunes models on synthetic adversarial dialogues paired with chain-of-thought rationales. These rationales reject user misinformation while reinforcing factual commitments. Experiments on challenging scientific QA benchmarks show that Pressure-Tune significantly enhances sycophancy resistance without compromising accuracy or responsiveness to valid feedback, offering a practical pathway toward more truthful and principled model behavior.

[21] MGT-Prism: Enhancing Domain Generalization for Machine-Generated Text Detection via Spectral Alignment

Shengchao Liu, Xiaoming Liu, Chengzhengxu Li, Zhaohan Zhang, Guoxin Ma, Yu Lan, Shuai Xiao

Main category: cs.CL

TL;DR: MGT-Prism is a machine-generated text detection method that uses frequency domain analysis to improve domain generalization, outperforming state-of-the-art baselines by ~0.9% on accuracy and F1 score across 11 test datasets.

DetailsMotivation: Current machine-generated text detectors perform well within the same domain but generalize poorly to unseen domains due to domain shift between different data sources.

Method: Proposes MGT-Prism with frequency domain analysis, using low frequency domain filtering to remove domain-sensitive features and dynamic spectrum alignment to extract domain-invariant features.

Result: Outperforms state-of-the-art baselines by an average of 0.90% in accuracy and 0.92% in F1 score on 11 test datasets across three domain-generalization scenarios.

Conclusion: Frequency domain analysis reveals consistent spectral patterns across domains and significant magnitude discrepancies between machine-generated and human-written texts, enabling better domain generalization for detection.

Abstract: Large Language Models have shown growing ability to generate fluent and coherent texts that are highly similar to the writing style of humans. Current detectors for Machine-Generated Text (MGT) perform well when they are trained and tested in the same domain but generalize poorly to unseen domains, due to domain shift between data from different sources. In this work, we propose MGT-Prism, an MGT detection method from the perspective of the frequency domain for better domain generalization. Our key insight stems from analyzing text representations in the frequency domain, where we observe consistent spectral patterns across diverse domains, while significant discrepancies in magnitude emerge between MGT and human-written texts (HWTs). The observation initiates the design of a low frequency domain filtering module for filtering out the document-level features that are sensitive to domain shift, and a dynamic spectrum alignment strategy to extract the task-specific and domain-invariant features for improving the detector’s performance in domain generalization. Extensive experiments demonstrate that MGT-Prism outperforms state-of-the-art baselines by an average of 0.90% in accuracy and 0.92% in F1 score on 11 test datasets across three domain-generalization scenarios.

[22] Can Large Language Models (LLMs) Describe Pictures Like Children? A Comparative Corpus Study

Hanna Woloszyn, Benjamin Gagl

Main category: cs.CL

TL;DR: LLMs fail to accurately replicate child language patterns, producing longer but less lexically rich texts with different semantic structures compared to real German children’s descriptions.

DetailsMotivation: To evaluate whether LLM-generated text resembles authentic child language, which is important for understanding their appropriateness in educational tools and psycholinguistic research.

Method: Generated two LLM corpora using picture stories with zero-shot and few-shot prompts, then compared them to real German children’s descriptions across psycholinguistic properties including word frequency, lexical richness, sentence length, POS tags, and semantic similarity.

Result: LLM texts were longer but less lexically rich, used more high-frequency words, under-represented nouns, and showed low semantic similarity. Few-shot prompts slightly improved similarities but still failed to replicate lexical and semantic patterns.

Conclusion: LLMs cannot accurately approximate child language through current prompting methods, raising concerns about their use in child-directed educational tools while providing insights for psycholinguistic research.

Abstract: The role of large language models (LLMs) in education is increasing, yet little attention has been paid to whether LLM-generated text resembles child language. This study evaluates how LLMs replicate child-like language by comparing LLM-generated texts to a collection of German children’s descriptions of picture stories. We generated two LLM-based corpora using the same picture stories and two prompt types: zero-shot and few-shot prompts specifying a general age from the children corpus. We conducted a comparative analysis across psycholinguistic text properties, including word frequency, lexical richness, sentence and word length, part-of-speech tags, and semantic similarity with word embeddings. The results show that LLM-generated texts are longer but less lexically rich, rely more on high-frequency words, and under-represent nouns. Semantic vector space analysis revealed low similarity, highlighting differences between the two corpora on the level of corpus semantics. Few-shot prompt increased similarities between children and LLM text to a minor extent, but still failed to replicate lexical and semantic patterns. The findings contribute to our understanding of how LLMs approximate child language through multimodal prompting (text + image) and give insights into their use in psycholinguistic research and education while raising important questions about the appropriateness of LLM-generated language in child-directed educational tools.

[23] TracSum: A New Benchmark for Aspect-Based Summarization with Sentence-Level Traceability in Medical Domain

Bohao Chu, Meijie Li, Sameh Frihat, Chengyu Gu, Georg Lodde, Elisabeth Livingstone, Norbert Fuhr

Main category: cs.CL

TL;DR: TracSum is a new benchmark for traceable, aspect-based medical document summarization with sentence-level citations to verify factual accuracy.

DetailsMotivation: Address concerns about factual accuracy in LLM-generated medical summaries by enabling users to trace evidence back to original sources through citations.

Method: Created benchmark with 500 annotated medical abstracts (3.5K summary-citation pairs), proposed evaluation framework with 4 metrics, and developed Track-Then-Sum pipeline baseline.

Result: TracSum effectively benchmarks traceable summarization; explicit sentence-level tracking improves accuracy, and full context improves completeness.

Conclusion: TracSum serves as an effective benchmark for traceable aspect-based summarization, with evidence tracing enhancing factual accuracy in medical document summarization.

Abstract: While document summarization with LLMs has enhanced access to textual information, concerns about the factual accuracy of these summaries persist, especially in the medical domain. Tracing evidence from which summaries are derived enables users to assess their accuracy, thereby alleviating this concern. In this paper, we introduce TracSum, a novel benchmark for traceable, aspect-based summarization, in which generated summaries are paired with sentence-level citations, enabling users to trace back to the original context. First, we annotate 500 medical abstracts for seven key medical aspects, yielding 3.5K summary-citation pairs. We then propose a fine-grained evaluation framework for this new task, designed to assess the completeness and consistency of generated content using four metrics. Finally, we introduce a summarization pipeline, Track-Then-Sum, which serves as a baseline method for comparison. In experiments, we evaluate both this baseline and a set of LLMs on TracSum, and conduct a human evaluation to assess the evaluation results. The findings demonstrate that TracSum can serve as an effective benchmark for traceable, aspect-based summarization tasks. We also observe that explicitly performing sentence-level tracking prior to summarization enhances generation accuracy, while incorporating the full context further improves completeness.

[24] Beyond Human Judgment: A Bayesian Evaluation of LLMs’ Moral Values Understanding

Maciej Skorski, Alina Landowska

Main category: cs.CL

TL;DR: Large language models rank among top 25% of human annotators in moral understanding, with better-than-average accuracy and fewer false negatives than humans.

DetailsMotivation: To evaluate how large language models understand moral dimensions compared to humans using Bayesian methods that capture human disagreement and model uncertainty.

Method: Used GPU-optimized Bayesian framework to process 1M+ model queries across 250K+ annotations from ~700 annotators on 100K+ texts from social media, news, and forums, evaluating top models (Claude Sonnet 4, DeepSeek-V3, Llama 4 Maverick).

Result: AI models typically rank among top 25% of human annotators with much better-than-average balanced accuracy, and produce far fewer false negatives than humans.

Conclusion: Large language models demonstrate superior moral detection capabilities compared to average human performance, with more sensitive detection and fewer missed moral issues.

Abstract: How do large language models understand moral dimensions compared to humans? This first large-scale Bayesian evaluation of market-leading language models provides the answer. In contrast to prior work using deterministic ground truth (majority or inclusion rules), we model annotator disagreements to capture both aleatoric uncertainty (inherent human disagreement) and epistemic uncertainty (model domain sensitivity). We evaluate top language models (Claude Sonnet 4, DeepSeek-V3, Llama 4 Maverick) across 250K+ annotations from ~700 annotators on 100K+ texts spanning social media, news, and forums. Our GPU-optimized Bayesian framework processed 1M+ model queries, revealing that AI models typically rank among the top 25% of human annotators, achieving much better-than-average balanced accuracy. Importantly, we find that AI produces far fewer false negatives than humans, highlighting their more sensitive moral detection capabilities.

[25] Prompt-Based One-Shot Exact Length-Controlled Generation with LLMs

Juncheng Xie, Hung-yi Lee

Main category: cs.CL

TL;DR: A prompt-based one-shot strategy that uses countdown markers and explicit counting rules to make LLMs generate exactly the desired number of tokens without fine-tuning or iterative sampling.

DetailsMotivation: LLMs struggle with precise length control, frequently overshooting or undershooting explicit length instructions because they cannot reliably maintain internal token counts.

Method: Appends countdown markers and explicit counting rules to prompts, forcing the model to “write while counting” in a one-shot approach without fine-tuning or iterative sampling.

Result: On MT-Bench-LI, strict length compliance with GPT-4.1 increased from below 30% to above 95%, surpassing draft-then-revise baseline while preserving answer quality. Effective across multiple settings including open-ended generation and summarization.

Conclusion: Precise length control can be achieved through prompt engineering alone, offering a lightweight alternative to training- or decoding-based methods.

Abstract: Controlling the length of text produced by large language models (LLMs) remains challenging: models frequently overshoot or undershoot explicit length instructions because they cannot reliably keep an internal token count. We present a prompt-based, one-shot strategy that compels an off-the-shelf LLM to generate exactly a desired number of tokens - words (English) or characters (Chinese) - without any fine-tuning or iterative sampling. The prompt appends countdown markers and explicit counting rules so that the model “writes while counting.” We evaluate on four settings: open-ended generation (1-1000 tokens), XSUM summarization, MT-Bench-LI instruction following, and the LIFEBENCH equal-length track. On MT-Bench-LI, strict length compliance with GPT-4.1 leaps from below 30% under naive prompts to above 95% with our countdown prompt, surpassing the popular draft-then-revise baseline, while judged answer quality is preserved. These results show that precise length control can be achieved through prompt engineering alone, offering a lightweight alternative to training- or decoding-based methods.

[26] The illusion of a perfect metric: Why evaluating AI’s words is harder than it looks

Maria Paz Oliva, Adriana Correia, Ivan Vankov, Viktor Botev

Main category: cs.CL

TL;DR: Current automatic evaluation metrics for NLG are inadequate as no single metric reliably approximates human judgment across tasks, with challenges persisting even in newer LLM-based evaluators and RAG evaluation.

DetailsMotivation: Natural Language Generation evaluation is crucial for AI adoption but remains challenging. Human evaluation is expensive and non-scalable, while existing automatic metrics lack consistency and comprehensive validation, creating a need for systematic analysis of current approaches.

Method: Conducted thorough examination of existing automatic evaluation metrics, analyzing their methodologies, documented strengths/limitations, validation methods, and correlations with human judgment across different tasks and datasets.

Result: Identified key challenges: metrics capture only specific text quality aspects, effectiveness varies by task/dataset, validation practices are unstructured, and correlations with human judgment are inconsistent. These issues persist in LLM-as-a-Judge metrics and RAG evaluation.

Conclusion: The quest for a ‘perfect metric’ is misguided. Instead, researchers should select metrics based on task-specific needs, use complementary evaluations, and focus new metric development on enhanced validation methodologies rather than universal solutions.

Abstract: Evaluating Natural Language Generation (NLG) is crucial for the practical adoption of AI, but has been a longstanding research challenge. While human evaluation is considered the de-facto standard, it is expensive and lacks scalability. Practical applications have driven the development of various automatic evaluation metrics (AEM), designed to compare the model output with human-written references, generating a score which approximates human judgment. Over time, AEMs have evolved from simple lexical comparisons, to semantic similarity models and, more recently, to LLM-based evaluators. However, it seems that no single metric has emerged as a definitive solution, resulting in studies using different ones without fully considering the implications. This paper aims to show this by conducting a thorough examination of the methodologies of existing metrics, their documented strengths and limitations, validation methods, and correlations with human judgment. We identify several key challenges: metrics often capture only specific aspects of text quality, their effectiveness varies by task and dataset, validation practices remain unstructured, and correlations with human judgment are inconsistent. Importantly, we find that these challenges persist in the most recent type of metric, LLM-as-a-Judge, as well as in the evaluation of Retrieval Augmented Generation (RAG), an increasingly relevant task in academia and industry. Our findings challenge the quest for the ‘perfect metric’. We propose selecting metrics based on task-specific needs and leveraging complementary evaluations and advocate that new metrics should focus on enhanced validation methodologies.

[27] Extracting Structured Requirements from Unstructured Building Technical Specifications for Building Information Modeling

Insaf Nahri, Romain Pinquié, Philippe Véron, Nicolas Bus, Mathieu Thorel

Main category: cs.CL

TL;DR: Integration of BIM and NLP to automate requirements extraction from French construction documents using NER and RE techniques with transformer models achieving high F1-scores.

DetailsMotivation: To automate the extraction of requirements from unstructured French Building Technical Specification documents in the construction industry, improving efficiency and accuracy.

Method: Used Named Entity Recognition (NER) with CamemBERT and Fr_core_news_lg transformer models, and Relation Extraction (RE) with Random Forest and other supervised models using custom feature vectors on a hand-crafted annotated dataset.

Result: CamemBERT and Fr_core_news_lg achieved F1-scores over 90% in NER, while Random Forest achieved F1 score above 80% in RE.

Conclusion: The study successfully demonstrates effective automation of requirements extraction from French construction documents, with plans to represent outcomes as a knowledge graph for enhanced automatic verification systems.

Abstract: This study explores the integration of Building Information Modeling (BIM) with Natural Language Processing (NLP) to automate the extraction of requirements from unstructured French Building Technical Specification (BTS) documents within the construction industry. Employing Named Entity Recognition (NER) and Relation Extraction (RE) techniques, the study leverages the transformer-based model CamemBERT and applies transfer learning with the French language model Fr_core_news_lg, both pre-trained on a large French corpus in the general domain. To benchmark these models, additional approaches ranging from rule-based to deep learning-based methods are developed. For RE, four different supervised models, including Random Forest, are implemented using a custom feature vector. A hand-crafted annotated dataset is used to compare the effectiveness of NER approaches and RE models. Results indicate that CamemBERT and Fr_core_news_lg exhibited superior performance in NER, achieving F1-scores over 90%, while Random Forest proved most effective in RE, with an F1 score above 80%. The outcomes are intended to be represented as a knowledge graph in future work to further enhance automatic verification systems.

[28] MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models

Jiacheng Ruan, Dan Jiang, Xian Gao, Ting Liu, Yuzhuo Fu, Yangyang Kang

Main category: cs.CL

TL;DR: MME-SCI is a comprehensive multilingual multimodal benchmark for evaluating scientific reasoning in MLLMs, addressing gaps in multilingual evaluation, modality coverage, and fine-grained knowledge assessment across 4 subjects and 5 languages.

DetailsMotivation: Existing scientific benchmarks for MLLMs lack proper evaluation of multilingual reasoning abilities, comprehensive modality coverage, and fine-grained scientific knowledge annotation, creating significant gaps in model assessment.

Method: Collected 1,019 high-quality question-answer pairs covering mathematics, physics, chemistry, and biology across 5 languages (Chinese, English, French, Spanish, Japanese) with 3 distinct evaluation modes and fine-grained knowledge point annotations.

Result: Extensive testing on 20 models (16 open-source, 4 closed-source) revealed significant challenges - o4-mini achieved only 52.11% (math), 24.73% (physics), 36.57% (chemistry), and 29.80% (biology) accuracy in image-only mode, demonstrating much higher difficulty than existing benchmarks.

Conclusion: MME-SCI provides a more challenging and comprehensive evaluation framework that reveals specific weaknesses in current MLLMs, particularly in multilingual scientific reasoning and modality handling, enabling better model analysis and improvement.

Abstract: Recently, multimodal large language models (MLLMs) have achieved significant advancements across various domains, and corresponding evaluation benchmarks have been continuously refined and improved. In this process, benchmarks in the scientific domain have played an important role in assessing the reasoning capabilities of MLLMs. However, existing benchmarks still face three key challenges: 1) Insufficient evaluation of models’ reasoning abilities in multilingual scenarios; 2) Inadequate assessment of MLLMs’ comprehensive modality coverage; 3) Lack of fine-grained annotation of scientific knowledge points. To address these gaps, we propose MME-SCI, a comprehensive and challenging benchmark. We carefully collected 1,019 high-quality question-answer pairs, which involve 3 distinct evaluation modes. These pairs cover four subjects, namely mathematics, physics, chemistry, and biology, and support five languages: Chinese, English, French, Spanish, and Japanese. We conducted extensive experiments on 16 open-source models and 4 closed-source models, and the results demonstrate that MME-SCI is widely challenging for existing MLLMs. For instance, under the Image-only evaluation mode, o4-mini achieved accuracy of only 52.11%, 24.73%, 36.57%, and 29.80% in mathematics, physics, chemistry, and biology, respectively, indicating a significantly higher difficulty level compared to existing benchmarks. More importantly, using MME-SCI’s multilingual and fine-grained knowledge attributes, we analyzed existing models’ performance in depth and identified their weaknesses in specific domains. The Data and Evaluation Code are available at https://github.com/JCruan519/MME-SCI.

[29] ReviewGraph: A Knowledge Graph Embedding Based Framework for Review Rating Prediction with Sentiment Features

A. J. W. de Vink, Natalia Amat-Lefort, Lifeng Han

Main category: cs.CL

TL;DR: ReviewGraph is a novel framework that transforms customer reviews into knowledge graphs with sentiment scores, using graph embeddings and machine learning to predict review ratings with performance comparable to LLMs but lower computational cost.

DetailsMotivation: Understanding factors driving customer review ratings is critical for improving guest satisfaction and business performance in the hospitality industry.

Method: Transforms textual reviews into knowledge graphs by extracting (subject, predicate, object) triples with sentiment scores, uses Node2Vec graph embeddings and sentiment features with machine learning classifiers for rating prediction.

Result: Performs similar to state-of-the-art models with lower computational cost, achieves comparable performance to LLMs, outperforms traditional NLP baselines on agreement metrics like Cohen’s Kappa, and offers better interpretability and visual exploration.

Conclusion: Graph-based representations show strong potential for review analytics, providing groundwork for future integration of advanced graph neural networks and fine-tuned LLM extraction methods.

Abstract: In the hospitality industry, understanding the factors that drive customer review ratings is critical for improving guest satisfaction and business performance. This work proposes ReviewGraph for Review Rating Prediction (RRP), a novel framework that transforms textual customer reviews into knowledge graphs by extracting (subject, predicate, object) triples and associating sentiment scores. Using graph embeddings (Node2Vec) and sentiment features, the framework predicts review rating scores through machine learning classifiers. We compare ReviewGraph performance with traditional NLP baselines (such as Bag of Words, TF-IDF, and Word2Vec) and large language models (LLMs), evaluating them in the HotelRec dataset. In comparison to the state of the art literature, our proposed model performs similar to their best performing model but with lower computational cost (without ensemble). While ReviewGraph achieves comparable predictive performance to LLMs and outperforms baselines on agreement-based metrics such as Cohen’s Kappa, it offers additional advantages in interpretability, visual exploration, and potential integration into Retrieval-Augmented Generation (RAG) systems. This work highlights the potential of graph-based representations for enhancing review analytics and lays the groundwork for future research integrating advanced graph neural networks and fine-tuned LLM-based extraction methods. We will share ReviewGraph output and platform open-sourced on our GitHub page https://github.com/aaronlifenghan/ReviewGraph

[30] Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization

Shaohua Duan, Xinze Li, Zhenghao Liu, Xiaoyuan Yi, Yukun Yan, Shuo Wang, Yu Gu, Ge Yu, Maosong Sun

Main category: cs.CL

TL;DR: LongMab-PO uses Multi-Armed Bandit strategy to select informative context chunks for generating high-quality diverse responses, then applies DPO training to improve long-context LLM performance.

DetailsMotivation: Existing fine-tuning approaches for long-context LLMs suffer from low diversity and factual inconsistencies in synthetic data, limiting their effectiveness in real-world long-context tasks.

Method: Proposes a framework that treats context chunks as MAB arms, selects chunks based on reward scores to generate responses, iteratively updates scores based on feedback, and applies DPO training on collected high-quality responses.

Result: Significantly improves diversity and quality of preference data pairs, achieving state-of-the-art performance on long-context reasoning benchmarks.

Conclusion: The MAB rollout strategy effectively identifies informative context segments for generating diverse, high-quality responses, enabling superior long-context modeling through DPO optimization.

Abstract: Long-context modeling is critical for a wide range of real-world tasks, including long-context question answering, summarization, and complex reasoning tasks. Recent studies have explored fine-tuning Large Language Models (LLMs) with synthetic data to enhance their long-context capabilities. However, the effectiveness of such approaches is often limited by the low diversity and factual inconsistencies in the generated data. To address these challenges, we propose LongMab-PO, a novel framework that leverages a Multi-Armed Bandit (MAB) rollout strategy to identify the most informative chunks from the given long context for sampling high-quality and diverse responses and constructing preference data pairs for Direct Preference Optimization (DPO) training. Specifically, we treat context chunks as arms of MAB, select chunks based on their expected reward scores to input into LLMs to generate responses, and iteratively update these scores based on reward feedback. This exploration and exploitation process enables the model to focus on the most relevant context segments, thereby generating and collecting high-quality and diverse responses. Finally, we collect these generated responses from the rollout process and apply the DPO method to further optimize the LLM. Experimental results show that LongMab-PO significantly improves the diversity and quality of preference data pairs, achieving state-of-the-art performance on long-context reasoning benchmarks. All code and data will be released on https://github.com/NEUIR/LongMab-PO.

[31] Ask Good Questions for Large Language Models

Qi Wu, Zhongqi Lu

Main category: cs.CL

TL;DR: The paper introduces AGQ framework with CEIRT model to improve dialog systems by better identifying user knowledge levels and generating guiding questions, outperforming baseline methods.

DetailsMotivation: Current LLM-based dialog systems fail to provide accurate topic guidance due to inability to discern user confusion in related concepts.

Method: Propose Ask-Good-Question (AGQ) framework with improved Concept-Enhanced Item Response Theory (CEIRT) model to identify user knowledge levels and generate guiding questions using LLMs.

Result: Outperforms baseline methods by significantly enhancing users’ information retrieval experiences and improving information retrieval efficiency.

Conclusion: The AGQ framework with CEIRT model effectively addresses limitations in current dialog systems by better understanding user confusion and providing more accurate guidance.

Abstract: Recent advances in large language models (LLMs) have significantly improved the performance of dialog systems, yet current approaches often fail to provide accurate guidance of topic due to their inability to discern user confusion in related concepts. To address this, we introduce the Ask-Good-Question (AGQ) framework, which features an improved Concept-Enhanced Item Response Theory (CEIRT) model to better identify users’ knowledge levels. Our contributions include applying the CEIRT model along with LLMs to directly generate guiding questions based on the inspiring text, greatly improving information retrieval efficiency during the question & answer process. Through comparisons with other baseline methods, our approach outperforms by significantly enhencing the users’ information retrieval experiences.

[32] Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

Xiao Liang, Zhongzhi Li, Yeyun Gong, Yelong Shen, Ying Nian Wu, Zhijiang Guo, Weizhu Chen

Main category: cs.CL

TL;DR: RLVR training improves Pass@1 but reduces diversity, hurting Pass@k. SvS strategy uses self-play with variational problem synthesis to maintain entropy and boost Pass@k performance by 18-23% on competition benchmarks.

DetailsMotivation: Vanilla RLVR training sacrifices policy entropy and generation diversity for Pass@1 gains, limiting the upper bound reasoning capability represented by Pass@k performance.

Method: Online Self-play with Variational problem Synthesis (SvS) strategy that uses policy’s correct solutions to synthesize variational problems while keeping reference answers identical, maintaining entropy during training.

Result: Absolute gains of 18.3% and 22.8% in Pass@32 performance on AIME24 and AIME25 benchmarks, with consistent improvements across 12 reasoning benchmarks and model sizes from 3B to 32B.

Conclusion: SvS effectively mitigates entropy collapse in RLVR training, sustains prolonged improvements, and demonstrates generalizability and robustness across various reasoning tasks and model scales.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy’s generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy’s correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.

[33] Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation

Dongyoon Hahm, Taywon Min, Woogyeol Jin, Kimin Lee

Main category: cs.CL

TL;DR: Fine-tuning LLMs for agentic tasks can unintentionally make them misaligned and more likely to execute harmful requests. PING method uses natural language prefixes to guide agents to refuse harmful tasks while maintaining performance on benign ones.

DetailsMotivation: Safety concerns are often overlooked when fine-tuning LLMs for agentic capabilities, leading to unintentional misalignment where models become more likely to execute harmful tasks and less likely to refuse them.

Method: Prefix INjection Guard (PING) - prepends automatically generated natural language prefixes to agent responses using an iterative approach that alternates between generating candidate prefixes and selecting those that optimize both task performance and refusal behavior.

Result: PING significantly enhances safety of fine-tuned LLM agents without sacrificing effectiveness, outperforming existing prompting approaches across diverse benchmarks in web navigation and code generation tasks. Analysis shows prefix tokens are crucial for behavior modification.

Conclusion: PING provides an effective method to maintain safety alignment in agentic LLMs during fine-tuning, demonstrating that carefully crafted natural language prefixes can guide models to refuse harmful requests while preserving performance on legitimate tasks.

Abstract: Beyond simple text generation, Large Language Models (LLMs) have evolved into agentic systems capable of planning and interacting with external tools to solve complex tasks. This evolution involves fine-tuning LLMs on agent-specific tasks to enhance their proficiency. However, safety concerns are frequently overlooked during this fine-tuning process. In this work, we show that aligned LLMs can become unintentionally misaligned, leading to a higher likelihood of executing harmful tasks and a reduced tendency to refuse them when fine-tuned to execute agentic tasks. To address these safety challenges, we propose Prefix INjection Guard (PING), a simple yet effective method that prepends automatically generated natural language prefixes to agent responses, guiding them to refuse harmful requests while preserving performance on benign tasks. Specifically, we introduce an iterative approach that alternates between (1) generating candidate prefixes and (2) selecting those that optimize both task performance and refusal behavior. Experimental results demonstrate that PING significantly enhances the safety of fine-tuned LLM agents without sacrificing their effectiveness. PING consistently outperforms existing prompting approaches across diverse benchmarks in both web navigation and code generation tasks. Our analysis of internal hidden states via linear probes reveals that prefix tokens are crucial for behavior modification, explaining the performance gains. WARNING: This paper contains contents that are unethical or offensive in nature.

[34] The Promise of Large Language Models in Digital Health: Evidence from Sentiment Analysis in Online Health Communities

Xiancheng Li, Georgios D. Karampatakis, Helen E. Wood, Chris J. Griffiths, Borislava Mihaylova, Neil S. Coulson, Alessio Pasinato, Pietro Panzarasa, Marco Viviani, Anna De Simoni

Main category: cs.CL

TL;DR: LLMs with expert knowledge integration through in-context learning achieve expert-level sentiment analysis performance on health data, outperforming traditional methods and addressing domain expertise shortages.

DetailsMotivation: Digital health analytics face challenges with complex patient-generated content requiring scarce domain expertise, while traditional ML approaches are constrained by data shortage and privacy limitations in healthcare.

Method: Developed a structured codebook encoding expert interpretation guidelines, enabling LLMs to apply domain-specific knowledge through targeted prompting. Compared six GPT models with DeepSeek and LLaMA 3.1 against BioBERT variants and lexicon-based methods using 400 expert-annotated posts from Online Health Communities.

Result: LLMs achieved superior performance with expert-level agreement, showing no statistically significant difference from inter-expert agreement levels, suggesting knowledge integration beyond surface-level pattern recognition.

Conclusion: LLMs with in-context learning offer a scalable solution for digital health analytics, addressing expert knowledge shortage and enabling real-time, expert-quality analysis for patient monitoring and evidence-based health strategies.

Abstract: Digital health analytics face critical challenges nowadays. The sophisticated analysis of patient-generated health content, which contains complex emotional and medical contexts, requires scarce domain expertise, while traditional ML approaches are constrained by data shortage and privacy limitations in healthcare settings. Online Health Communities (OHCs) exemplify these challenges with mixed-sentiment posts, clinical terminology, and implicit emotional expressions that demand specialised knowledge for accurate Sentiment Analysis (SA). To address these challenges, this study explores how Large Language Models (LLMs) can integrate expert knowledge through in-context learning for SA, providing a scalable solution for sophisticated health data analysis. Specifically, we develop a structured codebook that systematically encodes expert interpretation guidelines, enabling LLMs to apply domain-specific knowledge through targeted prompting rather than extensive training. Six GPT models validated alongside DeepSeek and LLaMA 3.1 are compared with pre-trained language models (BioBERT variants) and lexicon-based methods, using 400 expert-annotated posts from two OHCs. LLMs achieve superior performance while demonstrating expert-level agreement. This high agreement, with no statistically significant difference from inter-expert agreement levels, suggests knowledge integration beyond surface-level pattern recognition. The consistent performance across diverse LLM models, supported by in-context learning, offers a promising solution for digital health analytics. This approach addresses the critical challenge of expert knowledge shortage in digital health research, enabling real-time, expert-quality analysis for patient monitoring, intervention assessment, and evidence-based health strategies.

[35] iTBLS: A Dataset of Interactive Conversations Over Tabular Information

Anirudh Sundar, Christopher Richardson, Adar Avsian, Larry Heck

Main category: cs.CL

TL;DR: iTBLS dataset introduces interactive tabular conversations with interpretation, modification, and generation tasks, plus a QA-based framework that improves performance on tabular operations.

DetailsMotivation: To address the need for natural language manipulation of tabular data from academic papers and improve tabular understanding and generation tasks.

Method: Developed a dataset with three tabular task types and a novel framework that reformulates tabular operations as question-answering problems using user requests as evidence.

Result: Achieved improvements on all tasks compared to sequence-to-sequence baseline, with up to 13% Exact-Match accuracy and 16% BERTScores improvement on text-to-table tasks.

Conclusion: The QA-based reformulation approach effectively handles tabular manipulation tasks and significantly outperforms previous state-of-the-art methods.

Abstract: This paper introduces Interactive Tables (iTBLS), a dataset of interactive conversations that focuses on natural-language manipulation of tabular information sourced from academic pre-prints on ArXiv. The iTBLS dataset consists of three types of tabular tasks – interpretation, modification, and generation. Interpretation focuses on tabular understanding, modification focuses on manipulating tabular information, and generation focuses on the addition of new natural-language evidence. In addition, the paper presents a novel framework that reformulates tabular operations as question-answering, where an appropriate question is formulated based on the nature of interaction and the question is answered using the user request as evidence. The developed approach results in an improvement on all tasks on a sequence-to-sequence modeling baseline on iTBLS. In addition, the question-answering-based reformulation is applied to datasets from prior work for the text-to-table task where textual paragraphs are summarized into tables. The novel approach results in up to 13% improvement in Exact-Match accuracy and up to 16% improvement in BERTScores compared to the prior state-of-the-art.

[36] BQA: Body Language Question Answering Dataset for Video Large Language Models

Shintaro Ozaki, Kazuki Hayashi, Miyu Oba, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

Main category: cs.CL

TL;DR: BQA dataset for evaluating VideoLLMs’ ability to interpret body language emotions, revealing significant challenges and biases in current models.

DetailsMotivation: Nonverbal communication lacks formal rules and requires complex reasoning, making it difficult for VideoLLMs to accurately interpret body language and emotions from human unconscious actions.

Method: Created BQA dataset with 26 emotion labels from body language video clips, then evaluated various VideoLLMs on their ability to correctly interpret emotions from these nonverbal cues.

Result: Current VideoLLMs struggle significantly with understanding body language, and analysis revealed biased answers based on age group and ethnicity of individuals in videos.

Conclusion: Body language interpretation remains a major challenge for VideoLLMs, with demonstrated biases that need to be addressed through better datasets and model improvements.

Abstract: A large part of human communication relies on nonverbal cues such as facial expressions, eye contact, and body language. Unlike language or sign language, such nonverbal communication lacks formal rules, requiring complex reasoning based on commonsense understanding. Enabling current Video Large Language Models (VideoLLMs) to accurately interpret body language is a crucial challenge, as human unconscious actions can easily cause the model to misinterpret their intent. To address this, we propose a dataset, BQA, a body language question answering dataset, to validate whether the model can correctly interpret emotions from short clips of body language comprising 26 emotion labels of videos of body language. We evaluated various VideoLLMs on BQA and revealed that understanding body language is challenging, and our analyses of the wrong answers by VideoLLMs show that certain VideoLLMs made significantly biased answers depending on the age group and ethnicity of the individuals in the video. The dataset is available.

[37] Development of Pre-Trained Transformer-based Models for the Nepali Language

Prajwal Thapa, Jinu Nyachhyon, Mridul Sharma, Bal Krishna Bal

Main category: cs.CL

TL;DR: This paper addresses the underrepresentation of Nepali language in NLP by collecting the largest Nepali text corpus (27.5 GB) and pre-training BERT, RoBERTa, and GPT-2 models, achieving state-of-the-art performance on Nep-gLUE benchmark and text generation tasks.

DetailsMotivation: The Nepali language, spoken by 32 million people, is significantly underrepresented in NLP due to scarcity of monolingual data and limited resources. Existing efforts focus on encoder models, leaving a gap in decoder-based architectures.

Method: Collected 27.5 GB of Nepali text data (2.4x larger than previous corpora), pre-trained BERT, RoBERTa, and GPT-2 models specifically for Nepali, and performed instruction tuning for monolingual data.

Result: Models outperformed existing best model by 2 points on Nep-gLUE benchmark (scoring 95.60) and showed superior performance on text generation tasks, demonstrating improvements in both understanding and generating Nepali text.

Conclusion: The work provides a foundation for future Nepali NLP research with the largest available corpus and demonstrates the effectiveness of both encoder and decoder architectures for low-resource languages like Nepali.

Abstract: Transformer-based pre-trained language models have dominated the field of Natural Language Processing (NLP) for quite some time now. However, the Nepali language, spoken by approximately 32 million people worldwide, remains significantly underrepresented in this domain. This underrepresentation is primarily attributed to the scarcity of monolingual data corpora and limited available resources for the Nepali language. While existing efforts have predominantly concentrated on basic encoder-based models, there is a notable gap in the exploration of decoder-based architectures. To address this gap, we have collected 27.5 GB of Nepali text data, approximately 2.4x larger than any previously available Nepali language corpus. Leveraging this data, we pre-trained three different models i.e., BERT, RoBERTa, and GPT-2, exclusively for the Nepali Language. Furthermore, we performed instruction tuning and explored its potential for monolingual Nepali data, providing a foundation for future research. Our models outperformed the existing best model by 2 points on Nep-gLUE benchmark, scoring 95.60 and also outperformed existing models on text generation tasks, demonstrating improvements in both understanding and generating Nepali text.

[38] Uncovering Emergent Physics Representations Learned In-Context by Large Language Models

Yeongwoo Song, Jaeyong Bae, Dong-Kyum Kim, Hawoong Jeong

Main category: cs.CL

TL;DR: LLMs demonstrate in-context learning of physics concepts through dynamics forecasting tasks, with performance improving with longer contexts. Sparse autoencoders reveal that LLMs encode meaningful physical variables like energy during learning.

DetailsMotivation: To understand the precise mechanisms and internal structures within LLMs that enable successful in-context learning across diverse tasks, using physics-based tasks as a testbed due to their experimentally controllable, real-world data grounded in fundamental principles.

Method: Using dynamics forecasting tasks in physical systems to evaluate ICL capabilities, analyzing residual stream activations with sparse autoencoders (SAEs) to uncover how physics learning emerges in LLMs.

Result: Performance in dynamics forecasting improves with longer input contexts. SAE analysis shows that captured features correlate with key physical variables like energy, demonstrating that meaningful physical concepts are encoded during in-context learning.

Conclusion: The study provides a novel case study broadening our understanding of how LLMs learn in context, showing they can encode and reason about fundamental physical concepts through in-context learning mechanisms.

Abstract: Large language models (LLMs) exhibit impressive in-context learning (ICL) abilities, enabling them to solve wide range of tasks via textual prompts alone. As these capabilities advance, the range of applicable domains continues to expand significantly. However, identifying the precise mechanisms or internal structures within LLMs that allow successful ICL across diverse, distinct classes of tasks remains elusive. Physics-based tasks offer a promising testbed for probing this challenge. Unlike synthetic sequences such as basic arithmetic or symbolic equations, physical systems provide experimentally controllable, real-world data based on structured dynamics grounded in fundamental principles. This makes them particularly suitable for studying the emergent reasoning behaviors of LLMs in a realistic yet tractable setting. Here, we mechanistically investigate the ICL ability of LLMs, especially focusing on their ability to reason about physics. Using a dynamics forecasting task in physical systems as a proxy, we evaluate whether LLMs can learn physics in context. We first show that the performance of dynamics forecasting in context improves with longer input contexts. To uncover how such capability emerges in LLMs, we analyze the model’s residual stream activations using sparse autoencoders (SAEs). Our experiments reveal that the features captured by SAEs correlate with key physical variables, such as energy. These findings demonstrate that meaningful physical concepts are encoded within LLMs during in-context learning. In sum, our work provides a novel case study that broadens our understanding of how LLMs learn in context.

[39] Consolidating and Developing Benchmarking Datasets for the Nepali Natural Language Understanding Tasks

Jinu Nyachhyon, Mridul Sharma, Prajwal Thapa, Bal Krishna Bal

Main category: cs.CL

TL;DR: New Nepali language benchmark (NLUE) with 12 datasets expands existing limited evaluation framework to better assess NLP models on complex Nepali linguistic features.

DetailsMotivation: Current Nepali language benchmarks are too limited (only 4 tasks) to properly evaluate NLP models on the language's complex script, morphology, and dialect variations.

Method: Created 12 new datasets covering Single-Sentence Classification, Similarity/Paraphrase Tasks, Natural Language Inference, and General Masked Evaluation Task to form comprehensive NLUE benchmark.

Result: Existing top models struggle with the added complexity, and multilingual models outperform monolingual ones across most tasks, revealing need for better Nepali-specific solutions.

Conclusion: The expanded NLUE benchmark sets new standard for evaluating and advancing NLP models for low-resource languages like Nepali, contributing significantly to broader NLP research.

Abstract: The Nepali language has distinct linguistic features, especially its complex script (Devanagari script), morphology, and various dialects,which pose a unique challenge for Natural Language Understanding (NLU) tasks. While the Nepali Language Understanding Evaluation (Nep-gLUE) benchmark provides a foundation for evaluating models, it remains limited in scope, covering four tasks. This restricts their utility for comprehensive assessments of Natural Language Processing (NLP) models. To address this limitation, we introduce twelve new datasets, creating a new benchmark, the Nepali /Language Understanding Evaluation (NLUE) benchmark for evaluating the performance of models across a diverse set of Natural Language Understanding (NLU) tasks. The added tasks include Single-Sentence Classification, Similarity and Paraphrase Tasks, Natural Language Inference (NLI), and General Masked Evaluation Task (GMET). Through extensive experiments, we demonstrate that existing top models struggle with the added complexity of these tasks. We also find that the best multilingual model outperforms the best monolingual models across most tasks, highlighting the need for more robust solutions tailored to the Nepali language. This expanded benchmark sets a new standard for evaluating, comparing, and advancing models, contributing significantly to the broader goal of advancing NLP research for low-resource languages.

[40] Understanding the Impact of Confidence in Retrieval Augmented Generation: A Case Study in the Medical Domain

Shintaro Ozaki, Yuta Kato, Siyuan Feng, Masayo Tomita, Kazuki Hayashi, Wataru Hashimoto, Ryoma Obara, Masafumi Oyamada, Katsuhiko Hayashi, Hidetaka Kamigaito, Taro Watanabe

Main category: cs.CL

TL;DR: RAG enhances LLM accuracy by using external information, but confidence mechanisms are underexplored. This study examines if RAG improves LLM confidence in medical domain using probability-based metrics.

DetailsMotivation: Despite RAG's potential to enhance LLM responses with external knowledge, the confidence levels of RAG outputs remain poorly understood, especially in high-stakes applications like medical domain.

Method: Analyzed RAG impact on LLM confidence across various configurations and models by treating model’s predicted probability as output and calculating evaluation metrics including calibration error, entropy, best probability, and accuracy.

Result: Experimental results across multiple datasets confirmed that certain models can judge whether inserted documents relate to correct answers, suggesting output probabilities can determine if models function as generators in RAG framework.

Conclusion: Evaluating models based on output probabilities helps determine their capability to handle retrieved documents and function effectively within RAG frameworks, particularly in medical applications.

Abstract: Retrieval Augmented Generation (RAG) complements the knowledge of Large Language Models (LLMs) by leveraging external information to enhance response accuracy for queries. This approach is widely applied in several fields by taking its advantage of injecting the most up-to-date information, and researchers are focusing on understanding and improving this aspect to unlock the full potential of RAG in such high-stakes applications. However, despite the potential of RAG to address these needs, the mechanisms behind the confidence levels of its outputs remain underexplored. Our study focuses on the impact of RAG, specifically examining whether RAG improves the confidence of LLM outputs in the medical domain. We conduct this analysis across various configurations and models. We evaluate confidence by treating the model’s predicted probability as its output and calculating several evaluation metrics which include calibration error method, entropy, the best probability, and accuracy. Experimental results across multiple datasets confirmed that certain models possess the capability to judge for themselves whether an inserted document relates to the correct answer. These results suggest that evaluating models based on their output probabilities determine whether they function as generators in the RAG framework. Our approach allows us to evaluate whether the models handle retrieved documents.

[41] Universal Abstraction: Harnessing Frontier Models to Structure Real-World Data at Scale

Cliff Wong, Sam Preston, Qianchu Liu, Zelalem Gero, Jaspreet Bagga, Sheng Zhang, Shrey Jain, Theodore Zhao, Yu Gu, Yanbo Xu, Sid Kiblawi, Srinivasan Yegnasubramanian, Taxiarchis Botsis, Marvin Borja, Luis M. Ahumada, Joseph C. Murray, Guo Hui Gan, Roshanthi Weerasinghe, Kristina Young, Rom Leidner, Brian Piening, Carlo Bifulco, Tristan Naumann, Mu Wei, Hoifung Poon

Main category: cs.CL

TL;DR: UniMedAbstractor (UMA) is a zero-shot medical abstraction framework that uses frontier LLMs to extract structured clinical attributes from unstructured text without attribute-specific training, achieving performance comparable to specialized models.

DetailsMotivation: Traditional medical abstraction requires building attribute-specific models with extensive manual effort (rules or annotations), limiting scalability for extracting structured data from clinical notes.

Method: UMA uses a modular prompt template with frontier LLMs (like GPT-4o) for zero-shot extraction. Users only need lightweight natural language prompt adaptation for new attributes without training data or rules.

Result: UMA matched or exceeded state-of-the-art attribute-specific methods across various oncology attributes, including simple single-note attributes and complex multi-note reasoning tasks.

Conclusion: Frontier LLMs possess universal abstraction capabilities, enabling scalable medical abstraction without attribute-specific training, significantly reducing development time and cost.

Abstract: A significant fraction of real-world patient information resides in unstructured clinical text. Medical abstraction extracts and normalizes key structured attributes from free-text clinical notes, which is the prerequisite for a variety of important downstream applications, including registry curation, clinical trial operations, and real-world evidence generation. Prior medical abstraction methods typically resort to building attribute-specific models, each of which requires extensive manual effort such as rule creation or supervised label annotation for the individual attribute, thus limiting scalability. In this paper, we show that existing frontier models already possess the universal abstraction capability for scaling medical abstraction to a wide range of clinical attributes. We present UniMedAbstractor (UMA), a unifying framework for zero-shot medical abstraction with a modular, customizable prompt template and the selection of any frontier large language models. Given a new attribute for abstraction, users only need to conduct lightweight prompt adaptation in UMA to adjust the specification in natural languages. Compared to traditional methods, UMA eliminates the need for attribute-specific training labels or handcrafted rules, thus substantially reducing the development time and cost. We conducted a comprehensive evaluation of UMA in oncology using a wide range of marquee attributes representing the cancer patient journey. These include relatively simple attributes typically specified within a single clinical note (e.g. performance status), as well as complex attributes requiring sophisticated reasoning across multiple notes at various time points (e.g. tumor staging). Based on a single frontier model such as GPT-4o, UMA matched or even exceeded the performance of state-of-the-art attribute-specific methods, each of which was tailored to the individual attribute.

[42] Fact or Guesswork? Evaluating Large Language Models’ Medical Knowledge with Structured One-Hop Judgments

Jiaxi Li, Yiwei Wang, Kai Zhang, Yujun Cai, Bryan Hooi, Nanyun Peng, Kai-Wei Chang, Jin Lu

Main category: cs.CL

TL;DR: LLMs struggle with direct medical fact recall despite strong reasoning abilities, showing poor accuracy and calibration on fundamental medical knowledge.

DetailsMotivation: Existing medical QA benchmarks focus on complex reasoning rather than isolating LLMs' inherent medical knowledge retention, which is critical for high-stakes medical applications where factual errors can have serious consequences.

Method: Created Medical Knowledge Judgment Dataset (MKJ) from UMLS repository using binary classification to evaluate LLMs’ ability to judge validity of concise, one-hop medical statements, and explored retrieval-augmented generation to improve performance.

Result: LLMs showed difficulty accurately recalling medical facts, with performance varying across semantic types and significant weakness in uncommon medical conditions. LLMs also demonstrated poor calibration with overconfidence in incorrect answers.

Conclusion: Retrieval-augmented generation effectively improves factual accuracy and reduces uncertainty in medical decision-making, addressing LLMs’ limitations in direct medical knowledge recall.

Abstract: Large language models (LLMs) have been widely adopted in various downstream task domains. However, their abilities to directly recall and apply factual medical knowledge remains under-explored. Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs’ inherent medical knowledge from their reasoning capabilities. Given the high-stakes nature of medical applications, where incorrect information can have critical consequences, it is essential to evaluate the factuality of LLMs to retain medical knowledge. To address this challenge, we introduce the Medical Knowledge Judgment Dataset (MKJ), a dataset derived from the Unified Medical Language System (UMLS), a comprehensive repository of standardized biomedical vocabularies and knowledge graphs. Through a binary classification framework, MKJ evaluates LLMs’ grasp of fundamental medical facts by having them assess the validity of concise, one-hop statements, enabling direct measurement of their knowledge retention capabilities. Our experiments reveal that LLMs have difficulty accurately recalling medical facts, with performances varying substantially across semantic types and showing notable weakness in uncommon medical conditions. Furthermore, LLMs show poor calibration, often being overconfident in incorrect answers. To mitigate these issues, we explore retrieval-augmented generation, demonstrating its effectiveness in improving factual accuracy and reducing uncertainty in medical decision-making.

[43] Basic Category Usage in Vision Language Models

Hunter Sawyer, Jesse Roberts, Kyle Moore

Main category: cs.CL

TL;DR: Vision-language models exhibit human-like basic-level categorization preferences, including biological/non-biological effects and expert shifts, but expert prompting methods underperform compared to non-expert approaches.

DetailsMotivation: To investigate whether modern vision-language models demonstrate human-like basic-level categorization behaviors that have been well-established in psychology since Rosch's 1976 work, and to examine if they capture nuanced human categorization patterns.

Method: Analyzed two open-source VLMs (Llama 3.2 Vision Instruct 11B and Molmo 7B-D) using categorization tasks, testing for basic-level preferences, biological vs non-biological effects, and expert basic-level shifts through different prompting strategies.

Result: Both models showed basic-level categorization preferences consistent with human behavior, including nuanced patterns like biological/non-biological effects and expert shifts. However, expert prompting methods demonstrated lower accuracy than non-expert methods, contradicting common assumptions about expertise prompting.

Conclusion: VLMs acquire complex human cognitive categorization behaviors from training data, demonstrating psychological realism, but current expert prompting approaches may not effectively leverage this capability and require refinement.

Abstract: The field of psychology has long recognized a basic level of categorization that humans use when labeling visual stimuli, a term coined by Rosch in 1976. This level of categorization has been found to be used most frequently, to have higher information density, and to aid in visual language tasks with priming in humans. Here, we investigate basic-level categorization in two recently released, open-source vision-language models (VLMs). This paper demonstrates that Llama 3.2 Vision Instruct (11B) and Molmo 7B-D both prefer basic-level categorization consistent with human behavior. Moreover, the models’ preferences are consistent with nuanced human behaviors like the biological versus non-biological basic level effects and the well-established expert basic level shift, further suggesting that VLMs acquire complex cognitive categorization behaviors from the human data on which they are trained. We also find our expert prompting methods demonstrate lower accuracy then our non-expert prompting methods, contradicting popular thought regarding the use of expertise prompting methods.

[44] SEA-LION: Southeast Asian Languages in One Network

Raymond Ng, Thanh Ngan Nguyen, Yuli Huang, Ngee Chia Tai, Wai Yi Leong, Wei Qi Leong, Xianbin Yong, Jian Gang Ngui, Yosephine Susanto, Nicholas Cheng, Hamsawardhini Rengarajan, Peerat Limkonchotiwat, Adithya Venkatadri Hulagadri, Kok Wai Teng, Yeo Yeow Tong, Bryan Siow, Wei Yi Teo, Wayne Lau, Choon Meng Tan, Brandon Ong, Zhi Hao Ong, Jann Railey Montalan, Adwin Chan, Sajeban Antonyrex, Ren Lee, Esther Choa, David Ong Tat-Wee, Bing Jie Darius Liu, William Chandra Tjhi, Erik Cambria, Leslie Teo

Main category: cs.CL

TL;DR: The paper introduces SEA-LION LLMs - Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT - which are multilingual language models specifically designed for 11 Southeast Asian languages to address the English-centric bias in current LLM development.

DetailsMotivation: Most LLM research is English-centric, leaving low-resource Southeast Asian languages underrepresented. The authors aim to bridge this representation gap for the SEA region.

Method: Leveraged large-scale multilingual continued pre-training with comprehensive post-training including multiple stages of instruction fine-tuning, alignment, and model merging to support 11 SEA languages.

Result: Evaluation on multilingual benchmarks shows state-of-the-art performance across LLMs supporting Southeast Asian languages.

Conclusion: The SEA-LION models successfully address the representation gap for SEA languages and are open-sourced to benefit the wider Southeast Asian community.

Abstract: Recently, Large Language Models (LLMs) have dominated much of the artificial intelligence scene with their ability to process and generate natural languages. However, the majority of LLM research and development remains English-centric, leaving low-resource languages such as those in the Southeast Asian (SEA) region under-represented. To address this representation gap, we introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge multilingual LLMs designed for SEA languages. The SEA-LION family of LLMs supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. Our work leverages large-scale multilingual continued pre-training with a comprehensive post-training regime involving multiple stages of instruction fine-tuning, alignment, and model merging. Evaluation results on multilingual benchmarks indicate that our models achieve state-of-the-art performance across LLMs supporting SEA languages. We open-source the models to benefit the wider SEA community.

[45] Hallucinations and Key Information Extraction in Medical Texts: A Comprehensive Assessment of Open-Source Large Language Models

Anindya Bijoy Das, Shibbir Ahmed, Shahnewaz Karim Sakib

Main category: cs.CL

TL;DR: Evaluation of open-source LLMs for clinical discharge report summarization, showing good performance on admission reasons and hospitalization events but inconsistency in follow-up recommendations, with significant hallucination issues.

DetailsMotivation: Clinical summarization is crucial for healthcare to distill complex medical data into digestible information. LLMs show potential for automating this process but need rigorous evaluation for accuracy and reliability, especially regarding hallucination risks that could impact patient care.

Method: Comprehensive simulations to evaluate open-source LLMs (including Qwen2.5 and DeepSeek-v2) in extracting key events from discharge reports - admission reasons, major in-hospital events, and follow-up actions. Also assessed prevalence of various hallucination types in generated summaries.

Result: LLMs perform quite well in capturing admission reasons and hospitalization events, but are generally less consistent in identifying follow-up recommendations. The study reveals significant hallucination issues that affect the reliability of the generated clinical summaries.

Conclusion: While LLMs show promise for clinical summarization, there are broader challenges in leveraging them for comprehensive summarization due to inconsistency in capturing follow-up recommendations and prevalence of hallucinations that could impact patient care outcomes.

Abstract: Clinical summarization is crucial in healthcare as it distills complex medical data into digestible information, enhancing patient understanding and care management. Large language models (LLMs) have shown significant potential in automating and improving the accuracy of such summarizations due to their advanced natural language understanding capabilities. These models are particularly applicable in the context of summarizing medical/clinical texts, where precise and concise information transfer is essential. In this paper, we investigate the effectiveness of open-source LLMs in extracting key events from discharge reports, including admission reasons, major in-hospital events, and critical follow-up actions. In addition, we also assess the prevalence of various types of hallucinations in the summaries produced by these models. Detecting hallucinations is vital as it directly influences the reliability of the information, potentially affecting patient care and treatment outcomes. We conduct comprehensive simulations to rigorously evaluate the performance of these models, further probing the accuracy and fidelity of the extracted content in clinical summarization. Our results reveal that while the LLMs (e.g., Qwen2.5 and DeepSeek-v2) perform quite well in capturing admission reasons and hospitalization events, they are generally less consistent when it comes to identifying follow-up recommendations, highlighting broader challenges in leveraging LLMs for comprehensive summarization.

[46] Harnessing Structured Knowledge: A Concept Map-Based Approach for High-Quality Multiple Choice Question Generation with Effective Distractors

Nicy Scaria, Silvester John Joseph Kennedy, Diksha Seth, Ananya Thakur, Deepak Subramani

Main category: cs.CL

TL;DR: Concept map-based framework using LLMs to generate high-quality physics MCQs with misconception-based distractors, outperforming baseline methods in expert evaluation and student assessments.

DetailsMotivation: Manual creation of high-quality MCQs targeting diverse cognitive levels and incorporating misconceptions is time-consuming and expertise-intensive. Current automated approaches fail to generate questions at higher cognitive levels or incorporate domain-specific misconceptions effectively.

Method: Hierarchical concept map framework covering major physics topics with efficient database design. Automated pipeline retrieves topic-relevant concept map sections as structured context for LLMs to generate questions and distractors targeting common misconceptions, followed by automated validation.

Result: Expert evaluation shows 75.20% success rate in meeting all quality criteria vs ~37% for baseline methods. Student assessments reveal 28.05% guess success rate vs 37.10% for baselines, indicating better conceptual understanding assessment.

Conclusion: Concept map-based approach enables robust assessment across cognitive levels, instant identification of conceptual gaps, and facilitates faster feedback loops and targeted interventions at scale.

Abstract: Generating high-quality MCQs, especially those targeting diverse cognitive levels and incorporating common misconceptions into distractor design, is time-consuming and expertise-intensive, making manual creation impractical at scale. Current automated approaches typically generate questions at lower cognitive levels and fail to incorporate domain-specific misconceptions. This paper presents a hierarchical concept map-based framework that provides structured knowledge to guide LLMs in generating MCQs with distractors. We chose high-school physics as our test domain and began by developing a hierarchical concept map covering major Physics topics and their interconnections with an efficient database design. Next, through an automated pipeline, topic-relevant sections of these concept maps are retrieved to serve as a structured context for the LLM to generate questions and distractors that specifically target common misconceptions. Lastly, an automated validation is completed to ensure that the generated MCQs meet the requirements provided. We evaluate our framework against two baseline approaches: a base LLM and a RAG-based generation. We conducted expert evaluations and student assessments of the generated MCQs. Expert evaluation shows that our method significantly outperforms the baseline approaches, achieving a success rate of 75.20% in meeting all quality criteria compared to approximately 37% for both baseline methods. Student assessment data reveal that our concept map-driven approach achieved a significantly lower guess success rate of 28.05% compared to 37.10% for the baselines, indicating a more effective assessment of conceptual understanding. The results demonstrate that our concept map-based approach enables robust assessment across cognitive levels and instant identification of conceptual gaps, facilitating faster feedback loops and targeted interventions at scale.

[47] Crossing Borders Without Crossing Boundaries: How Sociolinguistic Awareness Can Optimize User Engagement with Localized Spanish AI Models Across Hispanophone Countries

Martin Capdevila, Esteban Villa Turek, Ellen Karina Chumbe Fernandez, Luis Felipe Polo Galvez, Andrea Marroquin, Rebeca Vargas Quesada, Johanna Crew, Nicole Vallejo Galarraga, Christopher Rodriguez, Diego Gutierrez, Radhi Datla

Main category: cs.CL

TL;DR: The paper argues for developing regional Spanish language models to address sociolinguistic differences across Latin America and Spain, proposing five sub-variants to improve AI localization and inclusivity.

DetailsMotivation: To highlight the critical need for region-specific Spanish language models due to significant sociolinguistic differences between Latin American and Spanish dialects, which create gaps in everyday language use and hinder AI effectiveness.

Method: Examines primary differences between variants of written Spanish across Latin America and Spain through in-depth sociocultural and linguistic contextualization, analyzing how these differences create sociolinguistic dissonances.

Result: Identifies that regional linguistic variations constitute significant gaps in quotidian Spanish use, demonstrating the need for locale-sensitive AI models to bridge these divides and improve localization strategies.

Conclusion: Implementing at least five proposed Spanish sub-variants would foster user trust in AI models while demonstrating cultural awareness, serving both inclusivity goals and sustainable user growth in a major geographic market.

Abstract: Large language models are, by definition, based on language. In an effort to underscore the critical need for regional localized models, this paper examines primary differences between variants of written Spanish across Latin America and Spain, with an in-depth sociocultural and linguistic contextualization therein. We argue that these differences effectively constitute significant gaps in the quotidian use of Spanish among dialectal groups by creating sociolinguistic dissonances, to the extent that locale-sensitive AI models would play a pivotal role in bridging these divides. In doing so, this approach informs better and more efficient localization strategies that also serve to more adequately meet inclusivity goals, while securing sustainable active daily user growth in a major low-risk investment geographic area. Therefore, implementing at least the proposed five sub variants of Spanish addresses two lines of action: to foment user trust and reliance on AI language models while also demonstrating a level of cultural, historical, and sociolinguistic awareness that reflects positively on any internationalization strategy.

[48] Model Performance-Guided Evaluation Data Selection for Effective Prompt Optimization

Ximing Dong, Shaowei Wang, Dayi Lin, Ahmed E. Hassan

Main category: cs.CL

TL;DR: IPOMP is a novel iterative evaluation data selection method for prompt optimization that uses semantic clustering and real-time model performance to select representative samples, improving effectiveness by 1.6-5.3% and stability by 57+% compared to SOTA methods.

DetailsMotivation: Manual prompt engineering is labor-intensive and ineffective, while automated prompt optimization techniques rely on randomly selected evaluation subsets that fail to represent full datasets, leading to unreliable evaluations and suboptimal prompts.

Method: Two-stage approach: 1) selects representative and diverse samples using semantic clustering and boundary analysis, 2) iterative refinement with real-time model performance data to replace redundant samples.

Result: IPOMP improves effectiveness by 1.6% to 5.3% and stability by at least 57% compared with SOTA baselines on BIG-bench dataset, with minimal computational overhead below 1%.

Conclusion: IPOMP effectively addresses the limitations of existing prompt optimization methods and the real-time performance-guided refinement approach can be universally applied to enhance existing coreset selection methods.

Abstract: Optimizing Large Language Model (LLM) performance requires well-crafted prompts, but manual prompt engineering is labor-intensive and often ineffective. Automated prompt optimization techniques address this challenge but the majority of them rely on randomly selected evaluation subsets, which fail to represent the full dataset, leading to unreliable evaluations and suboptimal prompts. Existing coreset selection methods, designed for LLM benchmarking, are unsuitable for prompt optimization due to challenges in clustering similar samples, high data collection costs, and the unavailability of performance data for new or private datasets. To overcome these issues, we propose IPOMP, an Iterative evaluation data selection for effective Prompt Optimization using real-time Model Performance. IPOMP is a two-stage approach that selects representative and diverse samples using semantic clustering and boundary analysis, followed by iterative refinement with real-time model performance data to replace redundant samples. Evaluations on the BIG-bench dataset show that IPOMP improves effectiveness by 1.6% to 5.3% and stability by at least 57% compared with SOTA baselines, with minimal computational overhead below 1%. Furthermore, the results demonstrate that our real-time performance-guided refinement approach can be universally applied to enhance existing coreset selection methods.

[49] “Haet Bhasha aur Diskrimineshun”: Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs

Darpan Aswal, Siddharth D Jaiswal

Main category: cs.CL

TL;DR: Novel jailbreak strategy using code-mixing and phonetic perturbations achieves high success rates (99% for text, 78% for image generation) against multilingual multimodal LLMs by bypassing safety filters through tokenization manipulation.

DetailsMotivation: Existing safety audits focus primarily on English, leaving models vulnerable to multilingual jailbreaking attacks, especially in multimodal contexts where prompts may contain misspelled words in real-world settings.

Method: Leverages code-mixing (combining multiple languages) and phonetic perturbations (applying misspellings to sensitive words) to create prompts that bypass safety filters while maintaining interpretability.

Result: Achieved 99% Attack Success Rate for text generation and 78% for image generation, with 100% Attack Relevance Rate for text and 95% for image generation using phonetically perturbed code-mixed prompts.

Conclusion: Phonetic perturbations impact word tokenization leading to jailbreak success, highlighting the need for more generalizable safety alignment in multilingual multimodal models to handle real-world scenarios with misspelled words.

Abstract: Recently released LLMs have strong multilingual & multimodal capabilities. Model vulnerabilities are exposed using audits and red-teaming efforts. Existing efforts have focused primarily on the English language; thus, models continue to be susceptible to multilingual jailbreaking strategies, especially for multimodal contexts. In this study, we introduce a novel strategy that leverages code-mixing and phonetic perturbations to jailbreak LLMs for both text and image generation tasks. We also introduce \textit{two new} jailbreak strategies that show higher effectiveness than baselines. Our work presents a method to effectively bypass safety filters in LLMs while maintaining interpretability by applying phonetic misspellings to sensitive words in code-mixed prompts. We achieve a 99% Attack Success Rate for text generation and 78% for image generation, with Attack Relevance Rate of 100% for text generation and 95% for image generation for the phonetically perturbed code-mixed prompts. Our interpretability experiments reveal that phonetic perturbations impact word tokenization, leading to jailbreak success. Our study motivates increasing the focus towards more generalizable safety alignment for multilingual multimodal models, especially in real-world settings wherein prompts can have misspelt words. \textit{\textbf{Warning: This paper contains examples of potentially harmful and offensive content.}}

[50] Soft Reasoning: Navigating Solution Spaces in Large Language Models through Controlled Embedding Exploration

Qinglin Zhu, Runcong Zhao, Hanqi Yan, Yulan He, Yudong Chen, Lin Gui

Main category: cs.CL

TL;DR: Soft Reasoning is an embedding-based search framework that optimizes first token embeddings using perturbation and Bayesian optimization to improve LLM reasoning accuracy with minimal computation.

DetailsMotivation: Large Language Models struggle with complex reasoning due to limited diversity in generation and inefficient search strategies, requiring a more effective approach.

Method: Combines embedding perturbation for controlled exploration and Bayesian optimization with verifier-guided objective to refine embeddings, balancing exploration and exploitation.

Result: Experiments show superior reasoning correctness and coherence with minimal computational overhead compared to existing methods.

Conclusion: Provides a scalable, model-agnostic solution that improves reasoning accuracy without relying on heuristic search approaches.

Abstract: Large Language Models (LLMs) struggle with complex reasoning due to limited diversity and inefficient search. We propose Soft Reasoning, an embedding-based search framework that optimises the embedding of the first token to guide generation. It combines (1) embedding perturbation for controlled exploration and (2) Bayesian optimisation to refine embeddings via a verifier-guided objective, balancing exploration and exploitation. This approach improves reasoning accuracy and coherence while avoiding reliance on heuristic search. Experiments demonstrate superior correctness with minimal computation, making it a scalable, model-agnostic solution. The code is released at https://github.com/alickzhu/Soft-Reasoning.

[51] PlantDeBERTa: An Open Source Language Model for Plant Science

Hiba Khey, Amine Lakhder, Salma Rouichi, Imane El Ghabi, Kamal Hejjaoui, Younes En-nahli, Fahd Kalloubi, Moez Amri

Main category: cs.CL

TL;DR: PlantDeBERTa is a domain-specific language model for plant science that extracts structured knowledge from plant stress-response literature using DeBERTa architecture and ontology-based annotation.

DetailsMotivation: Plant science lacks domain-adapted transformer models compared to biomedical and clinical NLP, creating a gap in agricultural natural language processing capabilities.

Method: Fine-tuned DeBERTa architecture on expert-annotated plant stress-response abstracts, combined with rule-enhanced linguistic post-processing and Crop Ontology-grounded entity normalization.

Result: PlantDeBERTa demonstrates strong generalization across entity types and enables precise extraction of biologically meaningful relationships from plant literature.

Conclusion: The model provides a scalable framework for agricultural NLP and bridges critical gaps in plant genomics, phenomics, and agronomic knowledge discovery.

Abstract: The rapid advancement of transformer-based language models has catalyzed breakthroughs in biomedical and clinical natural language processing; however, plant science remains markedly underserved by such domain-adapted tools. In this work, we present PlantDeBERTa, a high-performance, open-source language model specifically tailored for extracting structured knowledge from plant stress-response literature. Built upon the DeBERTa architecture-known for its disentangled attention and robust contextual encoding-PlantDeBERTa is fine-tuned on a meticulously curated corpus of expert-annotated abstracts, with a primary focus on lentil (Lens culinaris) responses to diverse abiotic and biotic stressors. Our methodology combines transformer-based modeling with rule-enhanced linguistic post-processing and ontology-grounded entity normalization, enabling PlantDeBERTa to capture biologically meaningful relationships with precision and semantic fidelity. The underlying corpus is annotated using a hierarchical schema aligned with the Crop Ontology, encompassing molecular, physiological, biochemical, and agronomic dimensions of plant adaptation. PlantDeBERTa exhibits strong generalization capabilities across entity types and demonstrates the feasibility of robust domain adaptation in low-resource scientific fields.By providing a scalable and reproducible framework for high-resolution entity recognition, PlantDeBERTa bridges a critical gap in agricultural NLP and paves the way for intelligent, data-driven systems in plant genomics, phenomics, and agronomic knowledge discovery. Our model is publicly released to promote transparency and accelerate cross-disciplinary innovation in computational plant science.

[52] PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation

Eliya Habba, Noam Dahan, Gili Lior, Gabriel Stanovsky

Main category: cs.CL

TL;DR: PromptSuite is a framework for automatically generating diverse prompt variations to enable robust multi-prompt evaluation of LLMs, addressing the unreliability of single-prompt testing.

DetailsMotivation: Single-prompt evaluation of LLMs is unreliable due to performance sensitivity to small prompt changes, but generating meaningful prompt variations for robust evaluation is challenging and limits practical adoption.

Method: PromptSuite uses a modular prompt design that allows controlled perturbations to each component, is extensible to support new components and perturbation types, and works out-of-the-box across various tasks and benchmarks.

Result: Case studies demonstrate that PromptSuite provides meaningful prompt variations that support strong evaluation practices, enabling more reliable LLM assessment.

Conclusion: PromptSuite offers a flexible, extensible solution for automatic prompt generation that facilitates robust multi-prompt evaluation of LLMs, with all resources made publicly available.

Abstract: Evaluating LLMs with a single prompt has proven unreliable, with small changes leading to significant performance differences. However, generating the prompt variations needed for a more robust multi-prompt evaluation is challenging, limiting its adoption in practice. To address this, we introduce PromptSuite, a framework that enables the automatic generation of various prompts. PromptSuite is flexible - working out of the box on a wide range of tasks and benchmarks. It follows a modular prompt design, allowing controlled perturbations to each component, and is extensible, supporting the addition of new components and perturbation types. Through a series of case studies, we show that PromptSuite provides meaningful variations to support strong evaluation practices. All resources, including the Python API, source code, user-friendly web interface, and demonstration video, are available at: https://eliyahabba.github.io/PromptSuite/.

[53] Measuring Stereotype and Deviation Biases in Large Language Models

Daniel Wang, Eli Brignac, Minjia Mao, Xiao Fang

Main category: cs.CL

TL;DR: LLMs exhibit significant stereotype bias (associating specific traits with demographic groups) and deviation bias (disparity between generated and real-world demographic distributions) when generating profiles, revealing potential harms in LLM outputs.

DetailsMotivation: To investigate limitations and potential risks of large language models by examining two types of bias - stereotype bias and deviation bias - that may occur when LLMs infer user attributes and generate content.

Method: Asked four advanced LLMs to generate profiles of individuals and examined associations between demographic groups and attributes like political affiliation, religion, and sexual orientation.

Result: All examined LLMs exhibited both significant stereotype bias and deviation bias towards multiple demographic groups.

Conclusion: The findings uncover biases in LLM-generated outputs and shed light on the potential harms when LLMs infer user attributes, highlighting important limitations and risks of current language models.

Abstract: Large language models (LLMs) are widely applied across diverse domains, raising concerns about their limitations and potential risks. In this study, we investigate two types of bias that LLMs may display: stereotype bias and deviation bias. Stereotype bias refers to when LLMs consistently associate specific traits with a particular demographic group. Deviation bias reflects the disparity between the demographic distributions extracted from LLM-generated content and real-world demographic distributions. By asking four advanced LLMs to generate profiles of individuals, we examine the associations between each demographic group and attributes such as political affiliation, religion, and sexual orientation. Our experimental results show that all examined LLMs exhibit both significant stereotype bias and deviation bias towards multiple groups. Our findings uncover the biases that occur when LLMs infer user attributes and shed light on the potential harms of LLM-generated outputs.

[54] Towards Reliable Multi-Agent Systems for Marketing Applications via Reflection, Memory, and Planning

Lorenzo Jaime Yu Flores, Junyi Shen, Goodman Gu

Main category: cs.CL

TL;DR: RAMP framework uses LLM agents with iterative planning, tool usage, verification, and memory to improve audience curation accuracy by 28% and user satisfaction.

DetailsMotivation: Limited research on reliability of LLM agents in real-world applications, particularly for marketing tasks like audience curation.

Method: Multi-agent framework with iterative planning, tool calling, output verification, reflection, and long-term memory store for client-specific knowledge.

Result: 28 percentage point accuracy increase on 88 evaluation queries, +20 percentage points recall improvement with more iterations, higher user satisfaction.

Conclusion: LLM planning with memory and iterative verification provides practical insights for deploying reliable AI systems in dynamic industry environments.

Abstract: Recent advances in large language models (LLMs) enabled the development of AI agents that can plan and interact with tools to complete complex tasks. However, literature on their reliability in real-world applications remains limited. In this paper, we introduce a multi-agent framework for a marketing task: audience curation. To solve this, we introduce a framework called RAMP that iteratively plans, calls tools, verifies the output, and generates suggestions to improve the quality of the audience generated. Additionally, we equip the model with a long-term memory store, which is a knowledge base of client-specific facts and past queries. Overall, we demonstrate the use of LLM planning and memory, which increases accuracy by 28 percentage points on a set of 88 evaluation queries. Moreover, we show the impact of iterative verification and reflection on more ambiguous queries, showing progressively better recall (roughly +20 percentage points) with more verify/reflect iterations on a smaller challenge set, and higher user satisfaction. Our results provide practical insights for deploying reliable LLM-based systems in dynamic, industry-facing environments.

Xin Dai, Buqiang Xu, Zhenghao Liu, Yukun Yan, Huiyuan Xie, Xiaoyuan Yi, Shuo Wang, Ge Yu

Main category: cs.CL

TL;DR: LegalΔ is a reinforcement learning framework that enhances legal reasoning in LLMs by maximizing information gain between direct answers and chain-of-thought reasoning, producing more reliable and interpretable legal judgments.

DetailsMotivation: Existing legal LLMs struggle with generating reliable and interpretable reasoning processes, often defaulting to fast-thinking behavior without explicit multi-step reasoning, which limits effectiveness in complex legal scenarios requiring rigorous justification.

Method: A two-stage reinforcement learning framework: (1) distills latent reasoning capabilities from DeepSeek-R1 (Large Reasoning Model), and (2) refines reasoning quality via differential comparisons with a multidimensional reward mechanism assessing structural coherence and legal-domain specificity.

Result: Outperforms strong baselines on multiple legal reasoning tasks in both accuracy and interpretability, consistently producing more robust and trustworthy legal judgments without relying on labeled preference data.

Conclusion: LegalΔ successfully addresses the interpretability challenge in legal AI by encouraging meaningful reasoning patterns through information gain maximization, demonstrating significant improvements in legal reasoning quality and reliability.

Abstract: Legal Artificial Intelligence (LegalAI) has achieved notable advances in automating judicial decision-making with the support of Large Language Models (LLMs). However, existing legal LLMs still struggle to generate reliable and interpretable reasoning processes. They often default to fast-thinking behavior by producing direct answers without explicit multi-step reasoning, limiting their effectiveness in complex legal scenarios that demand rigorous justification. To address this challenge, we propose Legal$\Delta$, a reinforcement learning framework designed to enhance legal reasoning through chain-of-thought guided information gain. During training, Legal$\Delta$ employs a dual-mode input setup-comprising direct answer and reasoning-augmented modes-and maximizes the information gain between them. This encourages the model to acquire meaningful reasoning patterns rather than generating superficial or redundant explanations. Legal$\Delta$ follows a two-stage approach: (1) distilling latent reasoning capabilities from a powerful Large Reasoning Model (LRM), DeepSeek-R1, and (2) refining reasoning quality via differential comparisons, combined with a multidimensional reward mechanism that assesses both structural coherence and legal-domain specificity. Experimental results on multiple legal reasoning tasks demonstrate that Legal$\Delta$ outperforms strong baselines in both accuracy and interpretability. It consistently produces more robust and trustworthy legal judgments without relying on labeled preference data. All code and data will be released at https://github.com/NEUIR/LegalDelta.

[56] MedKGent: A Large Language Model Agent Framework for Constructing Temporally Evolving Medical Knowledge Graph

Duzhen Zhang, Zixiao Wang, Zhong-Zhi Li, Yahan Yu, Shuncheng Jia, Jiahua Dong, Haotian Xu, Xing Wu, Yingying Zhang, Tielin Zhang, Jie Yang, Xiuying Chen, Le Song

Main category: cs.CL

TL;DR: MedKGent is an LLM agent framework that constructs temporally evolving medical knowledge graphs from PubMed abstracts, achieving 90% accuracy and demonstrating significant improvements in medical QA benchmarks.

DetailsMotivation: Current KG construction methods have limited generalizability, treat biomedical corpora as static, and ignore temporal dynamics and contextual uncertainty of evolving medical knowledge.

Method: Uses two specialized agents (Extractor and Constructor) powered by Qwen2.5-32B-Instruct to incrementally build KGs day-by-day from 10M+ PubMed abstracts (1975-2023), with sampling-based confidence scoring and temporal integration.

Result: Constructed KG with 156,275 entities and 2,971,384 relational triples; 90% accuracy verified by experts; significant improvements in RAG-based medical QA across 7 benchmarks using 5 leading LLMs.

Conclusion: MedKGent successfully addresses temporal dynamics in medical knowledge, enabling high-quality KG construction with practical utility for drug repurposing and medical question answering.

Abstract: The rapid expansion of medical literature presents growing challenges for structuring and integrating domain knowledge at scale. Knowledge Graphs (KGs) offer a promising solution by enabling efficient retrieval, automated reasoning, and knowledge discovery. However, current KG construction methods often rely on supervised pipelines with limited generalizability or naively aggregate outputs from Large Language Models (LLMs), treating biomedical corpora as static and ignoring the temporal dynamics and contextual uncertainty of evolving knowledge. To address these limitations, we introduce MedKGent, a LLM agent framework for constructing temporally evolving medical KGs. Leveraging over 10 million PubMed abstracts published between 1975 and 2023, we simulate the emergence of biomedical knowledge via a fine-grained daily time series. MedKGent incrementally builds the KG in a day-by-day manner using two specialized agents powered by the Qwen2.5-32B-Instruct model. The Extractor Agent identifies knowledge triples and assigns confidence scores via sampling-based estimation, which are used to filter low-confidence extractions and inform downstream processing. The Constructor Agent incrementally integrates the retained triples into a temporally evolving graph, guided by confidence scores and timestamps to reinforce recurring knowledge and resolve conflicts. The resulting KG contains 156,275 entities and 2,971,384 relational triples. Quality assessments by two SOTA LLMs and three domain experts demonstrate an accuracy approaching 90%, with strong inter-rater agreement. To evaluate downstream utility, we conduct RAG across seven medical question answering benchmarks using five leading LLMs, consistently observing significant improvements over non-augmented baselines. Case studies further demonstrate the KG’s value in literature-based drug repurposing via confidence-aware causal inference.

[57] CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description

Shaoming Duan, Zirui Wang, Chuanyi Liu, Zhibin Zhu, Yuhao Zhang, Peiyi Han, Liang Yan, Zewu Penge

Main category: cs.CL

TL;DR: CRED-SQL is a framework that addresses semantic mismatch in Text-to-SQL systems for large databases through cluster-based schema retrieval and an intermediate Execution Description Language (EDL) representation, achieving state-of-the-art performance.

DetailsMotivation: Large language models have improved Text-to-SQL accuracy, but semantic mismatch between natural language questions and SQL queries remains a critical challenge, especially in large databases where similar attributes cause schema linking issues and semantic drift.

Method: CRED-SQL uses cluster-based large-scale schema retrieval to identify relevant tables/columns, then introduces Execution Description Language (EDL) as an intermediate natural language representation to bridge NLQ-SQL gap. The task is decomposed into Text-to-EDL and EDL-to-SQL stages.

Result: Extensive experiments on SpiderUnion and BirdUnion benchmarks show CRED-SQL achieves new state-of-the-art performance, validating its effectiveness and scalability for large-scale cross-domain databases.

Conclusion: CRED-SQL successfully addresses semantic mismatch in Text-to-SQL systems through innovative cluster retrieval and intermediate language representation, demonstrating superior performance and scalability for large database applications.

Abstract: Recent advances in large language models (LLMs) have significantly improved the accuracy of Text-to-SQL systems. However, a critical challenge remains: the semantic mismatch between natural language questions (NLQs) and their corresponding SQL queries. This issue is exacerbated in large-scale databases, where semantically similar attributes hinder schema linking and semantic drift during SQL generation, ultimately reducing model accuracy. To address these challenges, we introduce CRED-SQL, a framework designed for large-scale databases that integrates Cluster Retrieval and Execution Description. CRED-SQL first performs cluster-based large-scale schema retrieval to pinpoint the tables and columns most relevant to a given NLQ, alleviating schema mismatch. It then introduces an intermediate natural language representation-Execution Description Language (EDL)-to bridge the gap between NLQs and SQL. This reformulation decomposes the task into two stages: Text-to-EDL and EDL-to-SQL, leveraging LLMs’ strong general reasoning capabilities while reducing semantic deviation. Extensive experiments on two large-scale, cross-domain benchmarks-SpiderUnion and BirdUnion-demonstrate that CRED-SQL achieves new state-of-the-art (SOTA) performance, validating its effectiveness and scalability. Our code is available at https://github.com/smduan/CRED-SQL.git

[58] Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward

Yong Deng, Guoqing Wang, Zhenzhe Ying, Xiaofeng Wu, Jinzhen Lin, Wenwen Xiong, Yuqin Dai, Shuo Yang, Zhanwei Zhang, Qiwen Wang, Yang Qin, Changhua Meng

Main category: cs.CL

TL;DR: Atom-Searcher is a novel RL framework that decomposes reasoning into fine-grained Atomic Thought units with specialized rewards, improving multi-hop reasoning and strategic search in agentic deep research.

DetailsMotivation: Current LLMs struggle with complex tasks due to static knowledge, and existing RAG approaches have limitations in multi-hop reasoning. Agentic approaches using outcome-based RL face issues like conflicting gradients and reward sparsity.

Method: Proposes Atomic Thought paradigm that decomposes reasoning into fine-grained functional units supervised by Reasoning Reward Models (RRMs) providing Atomic Thought Rewards. Atom-Searcher integrates this with a curriculum-inspired reward schedule that transitions from process-level to outcome rewards.

Result: Experiments on seven benchmarks show consistent improvements over state-of-the-art. The framework scales computation at test-time, provides better supervision anchors, and exhibits more interpretable, human-like reasoning patterns.

Conclusion: Atom-Searcher effectively addresses RL training challenges in agentic deep research by decomposing reasoning into atomic units with specialized rewards, leading to improved performance and more human-like reasoning patterns.

Abstract: Large language models (LLMs) exhibit remarkable problem-solving abilities, but struggle with complex tasks due to static internal knowledge. Retrieval-Augmented Generation (RAG) enhances access to external information, yet remains limited in multi-hop reasoning and strategic search due to rigid workflows. Recent advancements in agentic deep research empower LLMs to autonomously reason, search, and synthesize information. However, current approaches relying on outcome-based reinforcement learning (RL) face critical issues such as conflicting gradients and reward sparsity, limiting performance gains and training efficiency. To address these, we first propose Atomic Thought, a novel LLM thinking paradigm that decomposes reasoning into fine-grained functional units. These units are supervised by Reasoning Reward Models (RRMs), which provide Atomic Thought Rewards (ATR) for fine-grained guidance. Building on this, we propose Atom-Searcher, a novel RL framework for agentic deep research that integrates Atomic Thought and ATR. Atom-Searcher uses a curriculum-inspired reward schedule, prioritizing process-level ATR early and transitioning to outcome rewards, accelerating convergence on effective reasoning paths. Experiments on seven benchmarks show consistent improvements over the state-of-the-art. Key advantages include: (1) Atom-Searcher scales computation at test-time. (2) Atomic Thought provides supervision anchors for RRMs, bridging deep research tasks and RRMs. (3) Atom-Searcher exhibits more interpretable, human-like reasoning patterns.

cs.CV

[59] YOLO11-CR: a Lightweight Convolution-and-Attention Framework for Accurate Fatigue Driving Detection

Zhebin Jin, Ligang Dong

Main category: cs.CV

TL;DR: YOLO11-CR: A lightweight vision-based model for real-time driver fatigue detection using CNN-Transformer fusion and rectangular calibration modules to improve accuracy for small/occluded objects.

DetailsMotivation: Driver fatigue detection is critical for road safety, but existing methods are either intrusive (physiological/vehicle-based) or have limitations in detecting small/occluded objects and multi-scale features in vision-based approaches.

Method: Proposes YOLO11-CR with two key modules: Convolution-and-Attention Fusion Module (CAFM) integrates CNN local features with Transformer global context, and Rectangular Calibration Module (RCM) captures horizontal/vertical context for better spatial localization.

Result: Achieves 87.17% precision, 83.86% recall, 88.09% mAP@50, and 55.93% mAP@50-95 on DSM dataset, significantly outperforming baseline models. Ablation studies confirm effectiveness of CAFM and RCM modules.

Conclusion: YOLO11-CR provides a practical high-performance solution for in-vehicle fatigue monitoring with strong real-world deployment potential and future enhancement opportunities in temporal modeling and multi-modal integration.

Abstract: Driver fatigue detection is of paramount importance for intelligent transportation systems due to its critical role in mitigating road traffic accidents. While physiological and vehicle dynamics-based methods offer accuracy, they are often intrusive, hardware-dependent, and lack robustness in real-world environments. Vision-based techniques provide a non-intrusive and scalable alternative, but still face challenges such as poor detection of small or occluded objects and limited multi-scale feature modeling. To address these issues, this paper proposes YOLO11-CR, a lightweight and efficient object detection model tailored for real-time fatigue detection. YOLO11-CR introduces two key modules: the Convolution-and-Attention Fusion Module (CAFM), which integrates local CNN features with global Transformer-based context to enhance feature expressiveness; and the Rectangular Calibration Module (RCM), which captures horizontal and vertical contextual information to improve spatial localization, particularly for profile faces and small objects like mobile phones. Experiments on the DSM dataset demonstrated that YOLO11-CR achieves a precision of 87.17%, recall of 83.86%, mAP@50 of 88.09%, and mAP@50-95 of 55.93%, outperforming baseline models significantly. Ablation studies further validate the effectiveness of the CAFM and RCM modules in improving both sensitivity and localization accuracy. These results demonstrate that YOLO11-CR offers a practical and high-performing solution for in-vehicle fatigue monitoring, with strong potential for real-world deployment and future enhancements involving temporal modeling, multi-modal data integration, and embedded optimization.

[60] MIRAGE: Towards AI-Generated Image Detection in the Wild

Cheng Xia, Manxi Lin, Jiexiang Tan, Xiaoxiong Du, Yang Qiu, Junjun Zheng, Xiangheng Kong, Yuning Jiang, Bo Zheng

Main category: cs.CV

TL;DR: Mirage-R1 is a vision-language model with reflective reasoning that achieves state-of-the-art performance on in-the-wild AI-generated image detection, outperforming existing detectors by 5-10% on challenging benchmarks.

DetailsMotivation: AI-generated images pose significant threats to information security and public trust. Existing detectors fail in real-world scenarios where images are noisy, come from multiple generative models, and undergo quality editing.

Method: Proposes Mirage-R1, a vision-language model with heuristic-to-analytic reasoning. Uses two-stage training: supervised fine-tuning followed by reinforcement learning. Implements inference-time adaptive thinking for balancing speed and accuracy.

Result: Extensive experiments show Mirage-R1 leads state-of-the-art detectors by 5% on the Mirage benchmark and 10% on public benchmarks.

Conclusion: The proposed reflective reasoning mechanism and adaptive thinking strategy effectively address the challenges of in-the-wild AIGI detection, providing robust performance while maintaining practical inference speeds.

Abstract: The spreading of AI-generated images (AIGI), driven by advances in generative AI, poses a significant threat to information security and public trust. Existing AIGI detectors, while effective against images in clean laboratory settings, fail to generalize to in-the-wild scenarios. These real-world images are noisy, varying from ``obviously fake" images to realistic ones derived from multiple generative models and further edited for quality control. We address in-the-wild AIGI detection in this paper. We introduce Mirage, a challenging benchmark designed to emulate the complexity of in-the-wild AIGI. Mirage is constructed from two sources: (1) a large corpus of Internet-sourced AIGI verified by human experts, and (2) a synthesized dataset created through the collaboration between multiple expert generators, closely simulating the realistic AIGI in the wild. Building on this benchmark, we propose Mirage-R1, a vision-language model with heuristic-to-analytic reasoning, a reflective reasoning mechanism for AIGI detection. Mirage-R1 is trained in two stages: a supervised-fine-tuning cold start, followed by a reinforcement learning stage. By further adopting an inference-time adaptive thinking strategy, Mirage-R1 is able to provide either a quick judgment or a more robust and accurate conclusion, effectively balancing inference speed and performance. Extensive experiments show that our model leads state-of-the-art detectors by 5% and 10% on Mirage and the public benchmark, respectively. The benchmark and code will be made publicly available.

[61] Mitigating Easy Option Bias in Multiple-Choice Question Answering

Hao Zhang, Chen Li, Basura Fernando

Main category: cs.CV

TL;DR: The paper identifies an Easy-Options Bias (EOB) in VQA benchmarks where VLMs can answer correctly using only vision and options without questions, and introduces GroundAttack to generate hard negative options for more realistic evaluation.

DetailsMotivation: Current VQA benchmarks contain a bias that allows vision-language models to exploit visual-option similarity shortcuts without understanding questions, leading to inflated performance metrics that don't reflect true QA capabilities.

Method: The authors conduct grounding experiments to identify the bias, then develop GroundAttack - an automated toolkit that generates visually plausible hard negative options to eliminate the EOB issue from datasets.

Result: When tested on EOB-free annotations created for NExT-QA and MMStar, current VLMs drop to near-random accuracy under vision+option settings and show non-saturated performance under full question+vision+option settings.

Conclusion: The EOB issue significantly inflates VLM performance in current benchmarks, and the proposed GroundAttack method successfully creates more challenging, bias-free evaluations that better reflect models’ true visual question answering abilities.

Abstract: In this early study, we observe an Easy-Options Bias (EOB) issue in some multiple-choice Visual Question Answering (VQA) benchmarks such as MMStar, RealWorldQA, SEED-Bench, Next-QA, STAR benchmark and Video-MME. This bias allows vision-language models (VLMs) to select the correct answer using only the vision (V) and options (O) as inputs, without the need for the question (Q). Through grounding experiments, we attribute the bias to an imbalance in visual relevance: the correct answer typically aligns more closely with the visual contents than the negative options in feature space, creating a shortcut for VLMs to infer the answer via simply vision-option similarity matching. To fix this, we introduce GroundAttack, a toolkit that automatically generates hard negative options as visually plausible as the correct answer. We apply it to the NExT-QA and MMStar datasets, creating new EOB-free annotations. On these EOB-free annotations, current VLMs approach to random accuracies under (V+O) settings, and drop to non-saturated accuracies under (V+Q+O) settings, providing a more realistic evaluation of VLMs’ QA ability. Codes and new annotations will be released soon.

[62] DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model

Qian Chen, Xianyin Zhang, Lifan Guo, Feng Chen, Chi Zhang

Main category: cs.CV

TL;DR: DianJin-OCR-R1 is a reasoning-enhanced framework that combines vision-language models with expert OCR tools to reduce hallucinations and improve document image parsing performance.

DetailsMotivation: Large vision-language models suffer from hallucinations and underperform specialized OCR models on domain-specific tasks, despite their end-to-end capabilities.

Method: A reasoning-and-tool interleaved framework where the model first performs OCR, then calls expert tools for reference results, and finally rethinks the reasoning process to provide final recognition output.

Result: Outperforms both non-reasoning counterparts and expert OCR models on ReST and OmniDocBench benchmarks.

Conclusion: The reasoning-enhanced framework effectively mitigates hallucinations and leverages expert models to achieve superior OCR performance at lower cost.

Abstract: Recent advances in large vision-language models (LVLMs) have enabled a new paradigm of end-to-end document image parsing, excelling in Optical Character Recognition (OCR) tasks such as text, table, and formula recognition. However, generative LVLMs, similarly to large language models (LLMs), are prone to hallucinations–generating words that do not exist in input images. Furthermore, LVLMs are designed for general purposes and tend to be less effective on OCR tasks compared to expert models that are trained on domain-specific datasets. In this paper, we propose DianJin-OCR-R1, a reasoning-enhanced framework designed to address these limitations through training reasoning-and-tool interleaved VLMs. Given a recognition instruction, our DianJin-OCR-R1 model first recognizes the content in the input image by its own OCR capabilities, and then calls other tools (i.e., other expert models) to obtain their results as references, finally looks again the image and rethinks about the reasoning process to provide the final recognized content. Since architectures of expert models are tailored for specific OCR tasks, which makes them less prone to hallucinations, their results can help VLMs mitigate hallucinations. Additionally, expert models are typically smaller in scale and easy to iterate, enabling performance improvements for VLMs at a lower cost. We evaluate our model on ReST and OmniDocBench, and experimental results show that our DianJin-OCR-R1 models consistently outperform their non-reasoning counterparts and expert OCR models, which proves the effectiveness of our method.

[63] Exploration of Deep Learning Based Recognition for Urdu Text

Sumaiya Fazal, Sheeraz Ahmed

Main category: cs.CV

TL;DR: Proposed CNN-based Urdu OCR system using component classification and hierarchical neural networks, achieving 99% accuracy on ligature components.

DetailsMotivation: Urdu's cursive script and complex structure make traditional segmentation-based recognition error-prone, requiring a component-based approach.

Method: Used convolutional neural networks for automatic feature learning, trained on Urdu text dataset generated through character permutations and filtered with connected component technique to obtain ligatures only. Implemented hierarchical neural network with two levels for character permutations and component classification.

Result: Achieved 0.99% (99%) accuracy for component classification.

Conclusion: The component-based CNN approach successfully addresses Urdu’s complex geometrical structure and context sensitivity issues, providing high accuracy optical character recognition.

Abstract: Urdu is a cursive script language and has similarities with Arabic and many other South Asian languages. Urdu is difficult to classify due to its complex geometrical and morphological structure. Character classification can be processed further if segmentation technique is efficient, but due to context sensitivity in Urdu, segmentation-based recognition often results with high error rate. Our proposed approach for Urdu optical character recognition system is a component-based classification relying on automatic feature learning technique called convolutional neural network. CNN is trained and tested on Urdu text dataset, which is generated through permutation process of three characters and further proceeds to discarding unnecessary images by applying connected component technique in order to obtain ligature only. Hierarchical neural network is implemented with two levels to deal with three degrees of character permutations and component classification Our model successfully achieved 0.99% for component classification.

[64] CLoE: Curriculum Learning on Endoscopic Images for Robust MES Classification

Zeynep Ozdemir, Hacer Yalim Keles, Omer Ozgur Tanriover

Main category: cs.CV

TL;DR: CLoE is a curriculum learning framework that improves ulcerative colitis severity classification by accounting for label noise and ordinal structure, using image quality as a proxy for annotation confidence and achieving state-of-the-art results.

DetailsMotivation: MES classification for ulcerative colitis is challenging due to label noise from inter-observer variability and the ordinal nature of the score, which standard models often ignore.

Method: Proposes CLoE framework that uses image quality (estimated via BBPS labels) as proxy for annotation confidence to order samples from easy to hard, combined with ResizeMix augmentation for robustness.

Result: Achieves 82.5% accuracy and QWK of 0.894 on LIMUC dataset with ConvNeXt-Tiny, outperforming supervised and self-supervised baselines on both LIMUC and HyperKvasir datasets.

Conclusion: Difficulty-aware training strategies like CLoE show strong potential for improving ordinal classification under label uncertainty in medical imaging applications.

Abstract: Estimating disease severity from endoscopic images is essential in assessing ulcerative colitis, where the Mayo Endoscopic Subscore (MES) is widely used to grade inflammation. However, MES classification remains challenging due to label noise from inter-observer variability and the ordinal nature of the score, which standard models often ignore. We propose CLoE, a curriculum learning framework that accounts for both label reliability and ordinal structure. Image quality, estimated via a lightweight model trained on Boston Bowel Preparation Scale (BBPS) labels, is used as a proxy for annotation confidence to order samples from easy (clean) to hard (noisy). This curriculum is further combined with ResizeMix augmentation to improve robustness. Experiments on the LIMUC and HyperKvasir datasets, using both CNNs and Transformers, show that CLoE consistently improves performance over strong supervised and self-supervised baselines. For instance, ConvNeXt-Tiny reaches 82.5% accuracy and a QWK of 0.894 on LIMUC with low computational cost. These results highlight the potential of difficulty-aware training strategies for improving ordinal classification under label uncertainty. Code will be released at https://github.com/zeynepozdemir/CLoE.

[65] GaitCrafter: Diffusion Model for Biometric Preserving Gait Synthesis

Sirshapan Mitra, Yogesh S. Rawat

Main category: cs.CV

TL;DR: GaitCrafter is a diffusion-based framework that synthesizes realistic gait silhouette sequences, enabling controllable generation of identity-preserving gait patterns and novel identities for improved gait recognition while preserving privacy.

DetailsMotivation: Gait recognition lacks large-scale labeled datasets and faces challenges in collecting diverse gait samples while preserving privacy, limiting its effectiveness.

Method: Trains a video diffusion model from scratch exclusively on gait silhouette data to generate temporally consistent sequences controllable by covariates like clothing, carried objects, and view angle.

Result: Incorporating synthetic samples improves gait recognition performance, especially under challenging conditions, and novel identities generated through embedding interpolation exhibit unique, consistent gait patterns.

Conclusion: The work demonstrates successful application of diffusion models for high-quality, controllable, and privacy-aware gait data generation, advancing the field of gait recognition.

Abstract: Gait recognition is a valuable biometric task that enables the identification of individuals from a distance based on their walking patterns. However, it remains limited by the lack of large-scale labeled datasets and the difficulty of collecting diverse gait samples for each individual while preserving privacy. To address these challenges, we propose GaitCrafter, a diffusion-based framework for synthesizing realistic gait sequences in the silhouette domain. Unlike prior works that rely on simulated environments or alternative generative models, GaitCrafter trains a video diffusion model from scratch, exclusively on gait silhouette data. Our approach enables the generation of temporally consistent and identity-preserving gait sequences. Moreover, the generation process is controllable-allowing conditioning on various covariates such as clothing, carried objects, and view angle. We show that incorporating synthetic samples generated by GaitCrafter into the gait recognition pipeline leads to improved performance, especially under challenging conditions. Additionally, we introduce a mechanism to generate novel identities-synthetic individuals not present in the original dataset-by interpolating identity embeddings. These novel identities exhibit unique, consistent gait patterns and are useful for training models while maintaining privacy of real subjects. Overall, our work takes an important step toward leveraging diffusion models for high-quality, controllable, and privacy-aware gait data generation.

[66] RAPNet: A Receptive-Field Adaptive Convolutional Neural Network for Pansharpening

Tao Tang, Chengxu Yang

Main category: cs.CV

TL;DR: RAPNet introduces content-adaptive convolution for pansharpening, using spatially adaptive kernels and attention mechanisms to improve spatial detail extraction while maintaining spectral fidelity.

DetailsMotivation: Traditional CNNs use uniform convolutional kernels across all spatial positions, ignoring local content variations in pansharpening tasks, which limits their effectiveness in handling diverse remote sensing imagery.

Method: Proposes RAPNet with Receptive-field Adaptive Pansharpening Convolution (RAPConv) that generates spatially adaptive kernels based on local feature context, and Pansharpening Dynamic Feature Fusion (PAN-DFF) module with attention mechanism for optimal spatial-spectral balance.

Result: Comprehensive evaluations show RAPNet achieves superior performance compared to existing approaches on public datasets, with both quantitative metrics and qualitative assessments demonstrating improvements. Ablation studies confirm the effectiveness of the adaptive components.

Conclusion: RAPNet successfully addresses the limitations of uniform convolutional kernels in pansharpening by introducing content-adaptive convolution, leading to enhanced spatial detail extraction and better spectral preservation in fused remote sensing products.

Abstract: Pansharpening refers to the process of integrating a high resolution panchromatic (PAN) image with a lower resolution multispectral (MS) image to generate a fused product, which is pivotal in remote sensing. Despite the effectiveness of CNNs in addressing this challenge, they are inherently constrained by the uniform application of convolutional kernels across all spatial positions, overlooking local content variations. To overcome this issue, we introduce RAPNet, a new architecture that leverages content-adaptive convolution. At its core, RAPNet employs the Receptive-field Adaptive Pansharpening Convolution (RAPConv), designed to produce spatially adaptive kernels responsive to local feature context, thereby enhancing the precision of spatial detail extraction. Additionally, the network integrates the Pansharpening Dynamic Feature Fusion (PAN-DFF) module, which incorporates an attention mechanism to achieve an optimal balance between spatial detail enhancement and spectral fidelity. Comprehensive evaluations on publicly available datasets confirm that RAPNet delivers superior performance compared to existing approaches, as demonstrated by both quantitative metrics and qualitative assessments. Ablation analyses further substantiate the effectiveness of the proposed adaptive components.

[67] Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving

Minhao Xiong, Zichen Wen, Zhuangcheng Gu, Xuyang Liu, Rui Zhang, Hengrui Kang, Jiabing Yang, Junyuan Zhang, Weijia Li, Conghui He, Yafei Wang, Linfeng Zhang

Main category: cs.CV

TL;DR: Prune2Drive is a plug-and-play visual token pruning framework for multi-view vision-language models in autonomous driving that reduces computational overhead by selectively pruning 90% of visual tokens while maintaining performance.

DetailsMotivation: Vision-language models in autonomous driving face significant computational overhead from processing high-resolution multi-view images, which increases inference latency and memory consumption due to quadratic attention complexity.

Method: Proposes diversity-aware token selection inspired by farthest point sampling for semantic/spatial coverage, and a view-adaptive pruning controller that learns optimal pruning ratios per camera view without requiring model retraining.

Result: Achieves 6.4x speedup in prefilling phase and consumes only 13.4% of original FLOPs while retaining only 10% of visual tokens, with only 3% performance drop on DriveLM benchmark.

Conclusion: Prune2Drive provides an effective plug-and-play solution for reducing computational costs in multi-view VLMs for autonomous driving without performance degradation, making deployment more feasible.

Abstract: Vision-Language Models (VLMs) have emerged as a promising paradigm in autonomous driving (AD), offering a unified framework for perception, reasoning, and decision-making by jointly modeling visual inputs and natural language instructions. However, their deployment is hindered by the significant computational overhead incurred when processing high-resolution, multi-view images, a standard setup in AD systems with six or more synchronized cameras. This overhead stems from the large number of visual tokens generated during encoding, increasing inference latency and memory consumption due to the quadratic complexity of self-attention. To address these challenges, we propose Prune2Drive, a plug-and-play visual token pruning framework for multi-view VLMs in autonomous driving. Prune2Drive introduces two core innovations: (i) a diversity-aware token selection mechanism inspired by farthest point sampling, which prioritizes semantic and spatial coverage across views rather than relying solely on attention scores, and (ii) a view-adaptive pruning controller that learns optimal pruning ratios for each camera view based on their importance to downstream driving tasks. Unlike prior methods, Prune2Drive does not require model retraining or access to attention maps, making it compatible with modern efficient attention implementations. Extensive experiments on two large-scale multi-view driving benchmarks, DriveLM and DriveLMM-o1, show that Prune2Drive achieves significant speedups and memory savings while maintaining or improving task performance. When retaining only 10% of the visual tokens, our method achieves a 6.40$\times$ speedup in the prefilling phase and consumes 13.4% of the original FLOPs, with only a 3% performance drop on the DriveLM benchmark.

[68] DAASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples

Abdullah Al Nomaan Nafi, Habibur Rahaman, Zafaryab Haider, Tanzim Mahfuz, Fnu Suya, Swarup Bhunia, Prabuddha Chakraborty

Main category: cs.CV

TL;DR: DAASH is a differentiable meta-attack framework that strategically combines existing Lp-based attacks to generate perceptually aligned adversarial examples with superior success rates and visual quality.

DetailsMotivation: Traditional Lp-norm bounded adversarial examples often fail to align with human perception, and it's unclear if insights from Lp-constrained attacks can improve perceptual efficacy. Recent methods exploring perceptually aligned examples are limited.

Method: Multi-stage framework that aggregates candidate adversarial examples from multiple base attacks using learned adaptive weights. Uses a novel meta-loss function to jointly minimize misclassification loss and perceptual distortion, dynamically modulating each base attack’s contribution.

Result: Significantly outperforms state-of-the-art perceptual attacks (20.63% improvement in success rate) with superior visual quality (SSIM, LPIPS, and FID improvements of ≈11, 0.015, and 5.7). Generalizes well to unseen defenses.

Conclusion: DAASH provides a practical and strong baseline for evaluating robustness without requiring handcrafted adaptive attacks for each new defense, demonstrating that Lp-based methods can be effectively leveraged for perceptual alignment.

Abstract: Numerous techniques have been proposed for generating adversarial examples in white-box settings under strict Lp-norm constraints. However, such norm-bounded examples often fail to align well with human perception, and only recently have a few methods begun specifically exploring perceptually aligned adversarial examples. Moreover, it remains unclear whether insights from Lp-constrained attacks can be effectively leveraged to improve perceptual efficacy. In this paper, we introduce DAASH, a fully differentiable meta-attack framework that generates effective and perceptually aligned adversarial examples by strategically composing existing Lp-based attack methods. DAASH operates in a multi-stage fashion: at each stage, it aggregates candidate adversarial examples from multiple base attacks using learned, adaptive weights and propagates the result to the next stage. A novel meta-loss function guides this process by jointly minimizing misclassification loss and perceptual distortion, enabling the framework to dynamically modulate the contribution of each base attack throughout the stages. We evaluate DAASH on adversarially trained models across CIFAR-10, CIFAR-100, and ImageNet. Despite relying solely on Lp-constrained based methods, DAASH significantly outperforms state-of-the-art perceptual attacks such as AdvAD – achieving higher attack success rates (e.g., 20.63% improvement) and superior visual quality, as measured by SSIM, LPIPS, and FID (improvements $\approx$ of 11, 0.015, and 5.7, respectively). Furthermore, DAASH generalizes well to unseen defenses, making it a practical and strong baseline for evaluating robustness without requiring handcrafted adaptive attacks for each new defense.

[69] Automated Assessment of Aesthetic Outcomes in Facial Plastic Surgery

Pegah Varghaei, Kiran Abraham-Aggarwal, Manoj T. Abraham, Arun Ross

Main category: cs.CV

TL;DR: A computer-vision framework for quantifying aesthetic outcomes of facial plastic surgery using automated landmark detection, symmetry analysis, age estimation, and nasal morphology on the largest curated dataset of pre/post-operative facial images.

DetailsMotivation: To provide objective, quantitative benchmarks for evaluating facial plastic surgery outcomes, facilitating data-driven surgical planning, patient counseling, and objective outcome evaluation across different practices.

Method: Leverages automated landmark detection, geometric facial symmetry computation, deep-learning-based age estimation, and nasal morphology analysis on a dataset of 7,160 photographs from 1,259 patients, including a dedicated rhinoplasty subset.

Result: 96.2% of rhinoplasty patients showed improvement in at least one nasal measurement; significant improvements in alar width to face width ratio (77.0%), nose length to face height ratio (41.5%), and alar width to intercanthal ratio (39.3%). 71.3% showed enhancements in facial symmetry or perceived age. Patient identity consistency maintained with 99.5-99.6% True Match Rates.

Conclusion: The framework provides reproducible quantitative benchmarks and facilitates objective evaluation of facial plastic surgery outcomes, demonstrating significant improvements in nasal measurements, facial symmetry, and perceived age while maintaining patient identity consistency.

Abstract: We introduce a scalable, interpretable computer-vision framework for quantifying aesthetic outcomes of facial plastic surgery using frontal photographs. Our pipeline leverages automated landmark detection, geometric facial symmetry computation, deep-learning-based age estimation, and nasal morphology analysis. To perform this study, we first assemble the largest curated dataset of paired pre- and post-operative facial images to date, encompassing 7,160 photographs from 1,259 patients. This dataset includes a dedicated rhinoplasty-only subset consisting of 732 images from 366 patients, 96.2% of whom showed improvement in at least one of the three nasal measurements with statistically significant group-level change. Among these patients, the greatest statistically significant improvements (p < 0.001) occurred in the alar width to face width ratio (77.0%), nose length to face height ratio (41.5%), and alar width to intercanthal ratio (39.3%). Among the broader frontal-view cohort, comprising 989 rigorously filtered subjects, 71.3% exhibited significant enhancements in global facial symmetry or perceived age (p < 0.01). Importantly, our analysis shows that patient identity remains consistent post-operatively, with True Match Rates of 99.5% and 99.6% at a False Match Rate of 0.01% for the rhinoplasty-specific and general patient cohorts, respectively. Additionally, we analyze inter-practitioner variability in improvement rates. By providing reproducible, quantitative benchmarks and a novel dataset, our pipeline facilitates data-driven surgical planning, patient counseling, and objective outcome evaluation across practices.

[70] RED.AI Id-Pattern: First Results of Stone Deterioration Patterns with Multi-Agent Systems

Daniele Corradetti, José Delgado Rodrigues

Main category: cs.CV

TL;DR: AI multi-agent system for automated stone deterioration pattern identification that outperforms traditional expert-based methods

DetailsMotivation: Traditional stone deterioration identification methods are accurate but time-consuming and resource-intensive, requiring expert teams for direct observation

Method: Multi-agent AI system with 5 specialized agents (lithologist, pathologist, environmental expert, conservator-restorer, diagnostic coordinator) using cognitive architecture to simulate expert collaboration

Result: System evaluated on 28 difficult images with multiple deterioration patterns, showing significant improvement in all metrics compared to foundational model

Conclusion: The Id-Pattern system demonstrates that multi-agent AI can effectively automate stone pathology diagnosis and outperform traditional methods

Abstract: The Id-Pattern system within the RED.AI project (Reabilita\c{c}~ao Estrutural Digital atrav'es da AI) consists of an agentic system designed to assist in the identification of stone deterioration patterns. Traditional methodologies, based on direct observation by expert teams, are accurate but costly in terms of time and resources. The system developed here introduces and evaluates a multi-agent artificial intelligence (AI) system, designed to simulate collaboration between experts and automate the diagnosis of stone pathologies from visual evidence. The approach is based on a cognitive architecture that orchestrates a team of specialized AI agents which, in this specific case, are limited to five: a lithologist, a pathologist, an environmental expert, a conservator-restorer, and a diagnostic coordinator. To evaluate the system we selected 28 difficult images involving multiple deterioration patterns. Our first results showed a huge boost on all metrics of our system compared to the foundational model.

[71] Applications of Small Language Models in Medical Imaging Classification with a Focus on Prompt Strategies

Yiting Wang, Ziwei Wang, Jiachen Zhong, Di Zhu, Weiyi Li

Main category: cs.CV

TL;DR: Small language models with optimized prompts can achieve competitive accuracy in medical imaging classification, offering a practical alternative to large language models in healthcare settings.

DetailsMotivation: Large language models face adoption barriers in healthcare due to high computational costs, limited accessibility, and data privacy concerns, creating a need for more practical alternatives.

Method: Evaluated multiple small language models on NIH Chest X-ray dataset for classifying X-ray positions (AP vs. PA) using three prompt strategies: baseline instruction, incremental summary prompts, and correction-based reflective prompts.

Result: Certain small language models achieved competitive accuracy with well-crafted prompts, demonstrating that prompt engineering can substantially enhance SLM performance without requiring deep AI expertise.

Conclusion: Prompt engineering enables small language models to perform effectively in healthcare applications, providing a viable solution that addresses computational, accessibility, and privacy concerns associated with large language models.

Abstract: Large language models (LLMs) have shown remarkable capabilities in natural language processing and multi-modal understanding. However, their high computational cost, limited accessibility, and data privacy concerns hinder their adoption in resource-constrained healthcare environments. This study investigates the performance of small language models (SLMs) in a medical imaging classification task, comparing different models and prompt designs to identify the optimal combination for accuracy and usability. Using the NIH Chest X-ray dataset, we evaluate multiple SLMs on the task of classifying chest X-ray positions (anteroposterior [AP] vs. posteroanterior [PA]) under three prompt strategies: baseline instruction, incremental summary prompts, and correction-based reflective prompts. Our results show that certain SLMs achieve competitive accuracy with well-crafted prompts, suggesting that prompt engineering can substantially enhance SLM performance in healthcare applications without requiring deep AI expertise from end users.

[72] AIM 2025 Rip Current Segmentation (RipSeg) Challenge Report

Andrei Dumitriu, Florin Miron, Florin Tatui, Radu Tudor Ionescu, Radu Timofte, Aakash Ralhan, Florin-Alexandru Vasluianu, Shenyang Qian, Mitchell Harley, Imran Razzak, Yang Song, Pu Luo, Yumei Li, Cong Xu, Jinming Chai, Kexin Zhang, Licheng Jiao, Lingling Li, Siqi Yu, Chao Zhang, Kehuan Song, Fang Liu, Puhua Chen, Xu Liu, Jin Hu, Jinyang Xu, Biao Liu

Main category: cs.CV

TL;DR: AIM 2025 RipSeg Challenge focused on automatic rip current segmentation using the RipVIS dataset, with 75 participants and 5 valid submissions evaluated on composite metrics combining F1, F2, and AP scores.

DetailsMotivation: Rip currents are dangerous flows that pose major beach safety risks worldwide, making accurate visual detection an important but underexplored research task that requires precise segmentation.

Method: The challenge used the largest available rip current dataset (RipVIS) for single-class instance segmentation, evaluating teams on composite metrics (F1, F2, AP50, AP[50:95]) with top methods leveraging deep learning architectures, domain adaptation, pretrained models, and domain generalization.

Result: 75 participants registered with 5 valid test submissions. Top-performing methods successfully applied advanced deep learning techniques to improve rip current segmentation performance under diverse conditions.

Conclusion: The challenge provided insights into current rip current segmentation capabilities, identified key challenges, and outlined future directions for expanding RipSeg research to enhance beach safety through better detection methods.

Abstract: This report presents an overview of the AIM 2025 RipSeg Challenge, a competition designed to advance techniques for automatic rip current segmentation in still images. Rip currents are dangerous, fast-moving flows that pose a major risk to beach safety worldwide, making accurate visual detection an important and underexplored research task. The challenge builds on RipVIS, the largest available rip current dataset, and focuses on single-class instance segmentation, where precise delineation is critical to fully capture the extent of rip currents. The dataset spans diverse locations, rip current types, and camera orientations, providing a realistic and challenging benchmark. In total, $75$ participants registered for this first edition, resulting in $5$ valid test submissions. Teams were evaluated on a composite score combining $F_1$, $F_2$, $AP_{50}$, and $AP_{[50:95]}$, ensuring robust and application-relevant rankings. The top-performing methods leveraged deep learning architectures, domain adaptation techniques, pretrained models, and domain generalization strategies to improve performance under diverse conditions. This report outlines the dataset details, competition framework, evaluation metrics, and final results, providing insights into the current state of rip current segmentation. We conclude with a discussion of key challenges, lessons learned from the submissions, and future directions for expanding RipSeg.

[73] Structured Prompting and Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference

Yunxiang Yang, Ningning Xu, Jidong J. Yang

Main category: cs.CV

TL;DR: A novel structured prompting and knowledge distillation framework using VLMs to create VISTA, a compact 3B model for highway scene understanding and traffic risk assessment that achieves strong performance despite reduced size.

DetailsMotivation: Traditional approaches struggle with scalability and generalization in complex real-world traffic environments, requiring better solutions for highway scene understanding and traffic risk inference in ITS and autonomous driving.

Method: Uses structured Chain-of-Thought strategy with GPT-4o and o3-mini VLMs to generate multi-perspective outputs as pseudo-annotations for supervised fine-tuning of a smaller student VLM (VISTA).

Result: VISTA achieves strong performance across captioning metrics (BLEU-4, METEOR, ROUGE-L, CIDEr) comparable to teacher models despite being much smaller, enabling low-resolution traffic video understanding and risk-aware captions.

Conclusion: Effective knowledge distillation and structured multi-agent supervision can empower lightweight VLMs to capture complex reasoning capabilities, facilitating efficient deployment on edge devices for real-time risk monitoring.

Abstract: Comprehensive highway scene understanding and robust traffic risk inference are vital for advancing Intelligent Transportation Systems (ITS) and autonomous driving. Traditional approaches often struggle with scalability and generalization, particularly under the complex and dynamic conditions of real-world environments. To address these challenges, we introduce a novel structured prompting and knowledge distillation framework that enables automatic generation of high-quality traffic scene annotations and contextual risk assessments. Our framework orchestrates two large Vision-Language Models (VLMs): GPT-4o and o3-mini, using a structured Chain-of-Thought (CoT) strategy to produce rich, multi-perspective outputs. These outputs serve as knowledge-enriched pseudo-annotations for supervised fine-tuning of a much smaller student VLM. The resulting compact 3B-scale model, named VISTA (Vision for Intelligent Scene and Traffic Analysis), is capable of understanding low-resolution traffic videos and generating semantically faithful, risk-aware captions. Despite its significantly reduced parameter count, VISTA achieves strong performance across established captioning metrics (BLEU-4, METEOR, ROUGE-L, and CIDEr) when benchmarked against its teacher models. This demonstrates that effective knowledge distillation and structured multi-agent supervision can empower lightweight VLMs to capture complex reasoning capabilities. The compact architecture of VISTA facilitates efficient deployment on edge devices, enabling real-time risk monitoring without requiring extensive infrastructure upgrades.

[74] EDTalk++: Full Disentanglement for Controllable Talking Head Synthesis

Shuai Tan, Bin Ji

Main category: cs.CV

TL;DR: EDTalk++ is a novel framework for controllable talking head generation that enables full disentanglement of facial motions (mouth shape, head pose, eye movement, emotional expression) and supports both video and audio inputs through shared visual priors.

DetailsMotivation: Existing methods lack proper disentanglement of facial features, failing to ensure independent operation without mutual interference and the ability to share motion representations across different input modalities, which limits practical applications.

Method: Uses four lightweight modules to decompose facial dynamics into four distinct latent spaces (mouth, pose, eye, expression) with learnable orthogonal bases. Implements an efficient training strategy to allocate motion responsibilities and stores learned bases in banks for shared visual priors with audio input. Includes an Audio-to-Motion module for audio-driven synthesis.

Result: The framework successfully achieves independent manipulation of individual facial motions while maintaining compatibility with both video and audio input modalities, demonstrating effective disentanglement and cross-modal sharing capabilities.

Conclusion: EDTalk++ provides a comprehensive solution for fully disentangled facial motion control in talking head generation, addressing key limitations of existing methods and enabling more flexible applications across different input modalities.

Abstract: Achieving disentangled control over multiple facial motions and accommodating diverse input modalities greatly enhances the application and entertainment of the talking head generation. This necessitates a deep exploration of the decoupling space for facial features, ensuring that they a) operate independently without mutual interference and b) can be preserved to share with different modal inputs, both aspects often neglected in existing methods. To address this gap, this paper proposes EDTalk++, a novel full disentanglement framework for controllable talking head generation. Our framework enables individual manipulation of mouth shape, head pose, eye movement, and emotional expression, conditioned on video or audio inputs. Specifically, we employ four lightweight modules to decompose the facial dynamics into four distinct latent spaces representing mouth, pose, eye, and expression, respectively. Each space is characterized by a set of learnable bases whose linear combinations define specific motions. To ensure independence and accelerate training, we enforce orthogonality among bases and devise an efficient training strategy to allocate motion responsibilities to each space without relying on external knowledge. The learned bases are then stored in corresponding banks, enabling shared visual priors with audio input. Furthermore, considering the properties of each space, we propose an Audio-to-Motion module for audio-driven talking head synthesis. Experiments are conducted to demonstrate the effectiveness of EDTalk++.

[75] Revisiting MLLM Token Technology through the Lens of Classical Visual Coding

Jinming Liu, Junyan Lin, Yuntao Wei, Kele Shao, Keda Tao, Jianguo Huang, Xudong Yang, Zhibo Chen, Huan Wang, Xin Jin

Main category: cs.CV

TL;DR: This paper bridges MLLM token technology with classical visual coding, establishing a unified framework for comparative analysis and bidirectional knowledge transfer between the two fields.

DetailsMotivation: Both classical visual coding and MLLM token technology share the core objective of maximizing information fidelity while minimizing computational cost, suggesting potential for cross-disciplinary insights.

Method: The authors establish a unified formulation connecting token technology and visual coding, enabling systematic module-by-module comparative analysis and bidirectional knowledge synthesis.

Result: The study provides the first comprehensive structured technology comparison between MLLM token and visual coding, identifying how each field can inform the other’s development.

Conclusion: This research paves the way for more efficient multimodal models and more powerful visual codecs through cross-pollination of principles between visual coding and token technology.

Abstract: Classical visual coding and Multimodal Large Language Model (MLLM) token technology share the core objective - maximizing information fidelity while minimizing computational cost. Therefore, this paper reexamines MLLM token technology, including tokenization, token compression, and token reasoning, through the established principles of long-developed visual coding area. From this perspective, we (1) establish a unified formulation bridging token technology and visual coding, enabling a systematic, module-by-module comparative analysis; (2) synthesize bidirectional insights, exploring how visual coding principles can enhance MLLM token techniques’ efficiency and robustness, and conversely, how token technology paradigms can inform the design of next-generation semantic visual codecs; (3) prospect for promising future research directions and critical unsolved challenges. In summary, this study presents the first comprehensive and structured technology comparison of MLLM token and visual coding, paving the way for more efficient multimodal models and more powerful visual codecs simultaneously.

[76] MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling

Qian Wang, Ziqi Huang, Ruoxi Jia, Paul Debevec, Ning Yu

Main category: cs.CV

TL;DR: MAViS is a multi-agent collaborative framework for long-sequence video storytelling that addresses limitations in assistive capability, visual quality, and expressiveness through specialized agents working across multiple stages with a 3E Principle.

DetailsMotivation: Current long-sequence video generation frameworks suffer from poor assistive capability, suboptimal visual quality, and limited expressiveness, creating a need for a more comprehensive solution.

Method: End-to-end multi-agent framework orchestrating specialized agents across script writing, shot designing, character modeling, keyframe generation, video animation, and audio generation stages, following the 3E Principle (Explore, Examine, Enhance) with Script Writing Guidelines for compatibility.

Result: MAViS achieves state-of-the-art performance in assistive capability, visual quality, and video expressiveness, producing high-quality, expressive long-sequence videos with narratives and background music from brief user prompts.

Conclusion: MAViS provides the only framework offering multimodal design output (videos with narratives and background music), enables scalability with diverse generative models, and enriches user inspiration and creativity for long-sequence video storytelling.

Abstract: Despite recent advances, long-sequence video generation frameworks still suffer from significant limitations: poor assistive capability, suboptimal visual quality, and limited expressiveness. To mitigate these limitations, we propose MAViS, an end-to-end multi-agent collaborative framework for long-sequence video storytelling. MAViS orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, keyframe generation, video animation, and audio generation. In each stage, agents operate under the 3E Principle – Explore, Examine, and Enhance – to ensure the completeness of intermediate outputs. Considering the capability limitations of current generative models, we propose the Script Writing Guidelines to optimize compatibility between scripts and generative tools. Experimental results demonstrate that MAViS achieves state-of-the-art performance in assistive capability, visual quality, and video expressiveness. Its modular framework further enables scalability with diverse generative models and tools. With just a brief user prompt, MAViS is capable of producing high-quality, expressive long-sequence video storytelling, enriching inspirations and creativity for users. To the best of our knowledge, MAViS is the only framework that provides multimodal design output – videos with narratives and background music.

[77] Vision Transformers for Kidney Stone Image Classification: A Comparative Study with CNNs

Ivan Reyes-Amezcua, Francisco Lopez-Tiro, Clement Larose, Andres Mendez-Vazquez, Gilberto Ochoa-Ruiz, Christian Daul

Main category: cs.CV

TL;DR: Vision Transformers outperform CNNs in kidney stone classification from endoscopic images, achieving up to 95.2% accuracy compared to 64.5% with ResNet50, particularly excelling in complex imaging conditions.

DetailsMotivation: Kidney stone classification is crucial for personalized treatment but CNN models struggle with long-range dependencies and variable imaging conditions in endoscopic images.

Method: Comparative analysis between Vision Transformers (ViTs) and CNN-based models (ResNet50) on two ex vivo datasets with CCD camera and flexible ureteroscope images, using ViT-base pretrained on ImageNet-21k.

Result: ViT consistently outperformed ResNet50 across all conditions: 95.2% accuracy vs 64.5% in complex endoscopic patches, and 87.1% vs 78.4% in mixed-view CCD images, with similar improvements in F1-score, precision and recall.

Conclusion: ViT-based architectures provide superior classification performance and offer a scalable alternative to conventional CNNs for kidney stone image analysis.

Abstract: Kidney stone classification from endoscopic images is critical for personalized treatment and recurrence prevention. While convolutional neural networks (CNNs) have shown promise in this task, their limited ability to capture long-range dependencies can hinder performance under variable imaging conditions. This study presents a comparative analysis between Vision Transformers (ViTs) and CNN-based models, evaluating their performance on two ex vivo datasets comprising CCD camera and flexible ureteroscope images. The ViT-base model pretrained on ImageNet-21k consistently outperformed a ResNet50 baseline across multiple imaging conditions. For instance, in the most visually complex subset (Section patches from endoscopic images), the ViT model achieved 95.2% accuracy and 95.1% F1-score, compared to 64.5% and 59.3% with ResNet50. In the mixed-view subset from CCD-camera images, ViT reached 87.1% accuracy versus 78.4% with CNN. These improvements extend across precision and recall as well. The results demonstrate that ViT-based architectures provide superior classification performance and offer a scalable alternative to conventional CNNs for kidney stone image analysis.

[78] STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models

Tinh-Anh Nguyen-Nhu, Triet Dao Hoang Minh, Dat To-Thanh, Phuc Le-Gia, Tuan Vo-Lan, Tien-Huy Nguyen

Main category: cs.CV

TL;DR: STER-VLM is an efficient vision-language framework that improves traffic analysis by decomposing captions, selecting optimal frames, using reference-driven understanding, and curated prompts, achieving strong performance with reduced computational demands.

DetailsMotivation: Current vision-language models for traffic analysis require substantial computational resources and struggle with fine-grained spatio-temporal understanding, limiting their practical deployment in real-world applications.

Method: The framework uses four key techniques: (1) caption decomposition to handle spatial and temporal information separately, (2) temporal frame selection with best-view filtering, (3) reference-driven understanding for fine-grained motion capture, and (4) curated visual/textual prompt techniques.

Result: Experimental results on WTS and BDD datasets show substantial gains in semantic richness and traffic scene interpretation. The framework achieved a test score of 55.655 in the AI City Challenge 2025 Track 2.

Conclusion: STER-VLM effectively advances resource-efficient and accurate traffic analysis, demonstrating strong performance for real-world applications with reduced computational requirements.

Abstract: Vision-language models (VLMs) have emerged as powerful tools for enabling automated traffic analysis; however, current approaches often demand substantial computational resources and struggle with fine-grained spatio-temporal understanding. This paper introduces STER-VLM, a computationally efficient framework that enhances VLM performance through (1) caption decomposition to tackle spatial and temporal information separately, (2) temporal frame selection with best-view filtering for sufficient temporal information, and (3) reference-driven understanding for capturing fine-grained motion and dynamic context and (4) curated visual/textual prompt techniques. Experimental results on the WTS \cite{kong2024wts} and BDD \cite{BDD} datasets demonstrate substantial gains in semantic richness and traffic scene interpretation. Our framework is validated through a decent test score of 55.655 in the AI City Challenge 2025 Track 2, showing its effectiveness in advancing resource-efficient and accurate traffic analysis for real-world applications.

[79] MINR: Efficient Implicit Neural Representations for Multi-Image Encoding

Wenyong Zhou, Taiqiang Wu, Zhengwu Liu, Yuxin Cheng, Chen Zhang, Ngai Wong

Main category: cs.CV

TL;DR: MINR proposes shared intermediate layers for multi-image implicit neural representations, reducing parameters by 60% while maintaining performance across 100 images.

DetailsMotivation: Current implicit neural representations use separate networks per image, leading to computational and storage inefficiencies for multi-image encoding.

Method: Share intermediate layers across multiple images while keeping input/output layers image-specific, plus adding projection layers to capture unique features per image.

Result: Saves up to 60% parameters while maintaining comparable performance, scales to 100 images with average PSNR of 34 dB, and proves robust across various backbones.

Conclusion: MINR provides an efficient parameter-sharing approach for multi-image implicit neural representations without sacrificing reconstruction quality.

Abstract: Implicit Neural Representations (INRs) aim to parameterize discrete signals through implicit continuous functions. However, formulating each image with a separate neural network~(typically, a Multi-Layer Perceptron (MLP)) leads to computational and storage inefficiencies when encoding multi-images. To address this issue, we propose MINR, sharing specific layers to encode multi-image efficiently. We first compare the layer-wise weight distributions for several trained INRs and find that corresponding intermediate layers follow highly similar distribution patterns. Motivated by this, we share these intermediate layers across multiple images while preserving the input and output layers as input-specific. In addition, we design an extra novel projection layer for each image to capture its unique features. Experimental results on image reconstruction and super-resolution tasks demonstrate that MINR can save up to 60% parameters while maintaining comparable performance. Particularly, MINR scales effectively to handle 100 images, maintaining an average peak signal-to-noise ratio (PSNR) of 34 dB. Further analysis of various backbones proves the robustness of the proposed MINR.

[80] Distribution-Aware Hadamard Quantization for Hardware-Efficient Implicit Neural Representations

Wenyong Zhou, Jiachen Ren, Taiqiang Wu, Yuxin Cheng, Zhengwu Liu, Ngai Wong

Main category: cs.CV

TL;DR: DHQ is a distribution-aware Hadamard quantization scheme that quantizes both weights and activations in Implicit Neural Representations, achieving significant hardware efficiency gains while maintaining performance.

DetailsMotivation: INRs require full-precision computation which causes significant hardware overhead. Previous quantization methods only focused on weights, offering limited hardware savings due to lack of activation quantization.

Method: Proposes DHQ that uses Hadamard transformation to standardize diverse weight and activation distributions into a unified bell-shaped form before applying standard quantization, enabling quantization of both weights and activations.

Result: Reduces latency by 32.7%, energy consumption by 40.1%, and resource utilization by up to 98.3% compared to full-precision counterparts while outperforming previous quantization methods on image reconstruction tasks.

Conclusion: DHQ effectively addresses hardware inefficiency of INRs through comprehensive quantization of both weights and activations using distribution-aware Hadamard transformation, demonstrating practical FPGA implementation benefits.

Abstract: Implicit Neural Representations (INRs) encode discrete signals using Multi-Layer Perceptrons (MLPs) with complex activation functions. While INRs achieve superior performance, they depend on full-precision number representation for accurate computation, resulting in significant hardware overhead. Previous INR quantization approaches have primarily focused on weight quantization, offering only limited hardware savings due to the lack of activation quantization. To fully exploit the hardware benefits of quantization, we propose DHQ, a novel distribution-aware Hadamard quantization scheme that targets both weights and activations in INRs. Our analysis shows that the weights in the first and last layers have distributions distinct from those in the intermediate layers, while the activations in the last layer differ significantly from those in the preceding layers. Instead of customizing quantizers individually, we utilize the Hadamard transformation to standardize these diverse distributions into a unified bell-shaped form, supported by both empirical evidence and theoretical analysis, before applying a standard quantizer. To demonstrate the practical advantages of our approach, we present an FPGA implementation of DHQ that highlights its hardware efficiency. Experiments on diverse image reconstruction tasks show that DHQ outperforms previous quantization methods, reducing latency by 32.7%, energy consumption by 40.1%, and resource utilization by up to 98.3% compared to full-precision counterparts.

[81] AIM 2025 challenge on Inverse Tone Mapping Report: Methods and Results

Chao Wang, Francesco Banterle, Bin Ren, Radu Timofte, Xin Lu, Yufeng Peng, Chengjie Ge, Zhijing Sun, Ziang Zhou, Zihao Li, Zishun Liao, Qiyu Kang, Xueyang Fu, Zheng-Jun Zha, Zhijing Sun, Xingbo Wang, Kean Liu, Senyan Xu, Yang Qiu, Yifan Ding, Gabriel Eilertsen, Jonas Unger, Zihao Wang, Ke Wu, Jinshan Pan, Zhen Liu, Zhongyang Li, Shuaicheng Liu, S. M Nadim Uddin

Main category: cs.CV

TL;DR: AIM 2025 Challenge on Inverse Tone Mapping attracted 67 participants with 319 submissions, with top teams achieving PU21-PSNR up to 29.22 dB, establishing new benchmarks for HDR reconstruction from LDR images.

DetailsMotivation: To advance the development of effective inverse tone mapping algorithms for high-quality HDR image reconstruction from single LDR inputs, focusing on both perceptual fidelity and numerical accuracy.

Method: Comprehensive review and analysis of submissions from 67 participants, with detailed examination of the top five teams’ methodologies for HDR reconstruction from LDR images.

Result: 319 valid submissions were evaluated, with the best performing teams achieving a minimum PU21-PSNR of 29.22 dB among top entries, demonstrating significant progress in ITM algorithms.

Conclusion: The challenge successfully established strong benchmarks for inverse tone mapping research and highlighted innovative strategies that will guide future developments in HDR reconstruction quality enhancement.

Abstract: This paper presents a comprehensive review of the AIM 2025 Challenge on Inverse Tone Mapping (ITM). The challenge aimed to push forward the development of effective ITM algorithms for HDR image reconstruction from single LDR inputs, focusing on perceptual fidelity and numerical consistency. A total of \textbf{67} participants submitted \textbf{319} valid results, from which the best five teams were selected for detailed analysis. This report consolidates their methodologies and performance, with the lowest PU21-PSNR among the top entries reaching 29.22 dB. The analysis highlights innovative strategies for enhancing HDR reconstruction quality and establishes strong benchmarks to guide future research in inverse tone mapping.

[82] Enhancing Robustness of Implicit Neural Representations Against Weight Perturbations

Wenyong Zhou, Yuxin Cheng, Zhengwu Liu, Taiqiang Wu, Chen Zhang, Ngai Wong

Main category: cs.CV

TL;DR: First study on robustness of Implicit Neural Representations (INRs) showing vulnerability to weight perturbations, with proposed robust loss function achieving 7.5dB PSNR improvement.

DetailsMotivation: INRs demonstrate significant value in multimedia applications but are vulnerable to unavoidable weight perturbations that cause substantial performance degradation in signal reconstruction.

Method: Formulate robustness problem by minimizing difference between loss with/without weight perturbations, and derive novel robust loss function to regulate gradient of reconstruction loss with respect to weights.

Result: Extensive experiments across multiple modalities show up to 7.5dB improvement in PSNR values compared to original INRs under noisy conditions.

Conclusion: The proposed robust loss function effectively enhances INR robustness against weight perturbations, making them more suitable for real-world deployments.

Abstract: Implicit Neural Representations (INRs) encode discrete signals in a continuous manner using neural networks, demonstrating significant value across various multimedia applications. However, the vulnerability of INRs presents a critical challenge for their real-world deployments, as the network weights might be subjected to unavoidable perturbations. In this work, we investigate the robustness of INRs for the first time and find that even minor perturbations can lead to substantial performance degradation in the quality of signal reconstruction. To mitigate this issue, we formulate the robustness problem in INRs by minimizing the difference between loss with and without weight perturbations. Furthermore, we derive a novel robust loss function to regulate the gradient of the reconstruction loss with respect to weights, thereby enhancing the robustness. Extensive experiments on reconstruction tasks across multiple modalities demonstrate that our method achieves up to a 7.5~dB improvement in peak signal-to-noise ratio (PSNR) values compared to original INRs under noisy conditions.

[83] FAMNet: Integrating 2D and 3D Features for Micro-expression Recognition via Multi-task Learning and Hierarchical Attention

Liangyu Fu, Xuecheng Wu, Danlei Huang, Xinyi Yin

Main category: cs.CV

TL;DR: FAMNet: A multi-task learning framework combining 2D and 3D CNNs with hierarchical attention for micro-expression recognition, achieving state-of-the-art performance on multiple datasets.

DetailsMotivation: Micro-expression recognition is challenging due to short duration and low intensity of expressions. Current deep learning methods struggle to effectively extract fine-grained spatiotemporal features from micro-expressions.

Method: Proposes FAMNet - a fusion model with 2D CNN (AMNet2D) and 3D CNN (AMNet3D) using ResNet18 backbone and attention modules. Uses multi-task learning combining MER and facial action unit detection with parameter hard sharing.

Result: Achieves 83.75% UAR and 84.03% UF1 on SAMM, CASME II, MMEW datasets. On challenging CAS(ME)^3 dataset: 51% UAR and 43.42% UF1, showing significant performance improvements.

Conclusion: The proposed multi-task learning approach with hierarchical attention and 2D-3D CNN fusion effectively extracts omni-directional features for micro-expression recognition, demonstrating superior performance across multiple benchmark datasets.

Abstract: Micro-expressions recognition (MER) has essential application value in many fields, but the short duration and low intensity of micro-expressions (MEs) bring considerable challenges to MER. The current MER methods in deep learning mainly include three data loading methods: static images, dynamic image sequence, and a combination of the two streams. How to effectively extract MEs’ fine-grained and spatiotemporal features has been difficult to solve. This paper proposes a new MER method based on multi-task learning and hierarchical attention, which fully extracts MEs’ omni-directional features by merging 2D and 3D CNNs. The fusion model consists of a 2D CNN AMNet2D and a 3D CNN AMNet3D, with similar structures consisting of a shared backbone network Resnet18 and attention modules. During training, the model adopts different data loading methods to adapt to two specific networks respectively, jointly trains on the tasks of MER and facial action unit detection (FAUD), and adopts the parameter hard sharing for information association, which further improves the effect of the MER task, and the final fused model is called FAMNet. Extensive experimental results show that our proposed FAMNet significantly improves task performance. On the SAMM, CASME II and MMEW datasets, FAMNet achieves 83.75% (UAR) and 84.03% (UF1). Furthermore, on the challenging CAS(ME)$^3$ dataset, FAMNet achieves 51% (UAR) and 43.42% (UF1).

[84] CORENet: Cross-Modal 4D Radar Denoising Network with LiDAR Supervision for Autonomous Driving

Fuyang Liu, Jilin Mei, Fangyuan Mao, Chen Min, Yan Xing, Yu Hu

Main category: cs.CV

TL;DR: CORENet is a cross-modal denoising framework that uses LiDAR supervision to clean noisy 4D radar data for improved object detection, achieving state-of-the-art performance while maintaining radar-only operation during inference.

DetailsMotivation: 4D radar provides robust perception in adverse weather but suffers from sparse and noisy point clouds that limit detection effectiveness. Current methods struggle with the elevated noise levels in raw radar data.

Method: Proposes CORENet - a plug-and-play cross-modal denoising framework that leverages LiDAR data during training to identify noise patterns and extract discriminative features from 4D radar point clouds. Works with existing voxel-based detection pipelines without modification.

Result: Extensive evaluation on the challenging Dual-Radar dataset shows CORENet effectively enhances detection robustness and achieves superior performance compared to existing mainstream approaches.

Conclusion: The framework successfully addresses 4D radar noise limitations through LiDAR-supervised denoising while maintaining the practical advantage of radar-only operation during inference, making it suitable for real-world autonomous driving applications.

Abstract: 4D radar-based object detection has garnered great attention for its robustness in adverse weather conditions and capacity to deliver rich spatial information across diverse driving scenarios. Nevertheless, the sparse and noisy nature of 4D radar point clouds poses substantial challenges for effective perception. To address the limitation, we present CORENet, a novel cross-modal denoising framework that leverages LiDAR supervision to identify noise patterns and extract discriminative features from raw 4D radar data. Designed as a plug-and-play architecture, our solution enables seamless integration into voxel-based detection frameworks without modifying existing pipelines. Notably, the proposed method only utilizes LiDAR data for cross-modal supervision during training while maintaining full radar-only operation during inference. Extensive evaluation on the challenging Dual-Radar dataset, which is characterized by elevated noise level, demonstrates the effectiveness of our framework in enhancing detection robustness. Comprehensive experiments validate that CORENet achieves superior performance compared to existing mainstream approaches.

[85] Multi-view Clustering via Bi-level Decoupling and Consistency Learning

Shihao Dong, Yuhui Zheng, Huiying Xu, Xinzhong Zhu

Main category: cs.CV

TL;DR: A novel bi-level decoupling and consistency learning framework (BDCL) for multi-view clustering that enhances feature discriminability and cluster compactness through instance learning, feature/cluster decoupling, and consistency learning.

DetailsMotivation: To improve multi-view clustering performance by addressing the overlooked aspect of cluster-oriented representation learning and better leveraging consistency and complementarity between multi-view features.

Method: Three-module framework: 1) multi-view instance learning with autoencoder reconstruction and contrastive learning, 2) bi-level decoupling of features and clusters, 3) consistency learning using positive pairs from different views and neighbors.

Result: Experimental results on five benchmark datasets demonstrate superiority over state-of-the-art methods.

Conclusion: The proposed BDCL framework effectively enhances inter-cluster discriminability and intra-cluster compactness, achieving superior performance in multi-view clustering tasks.

Abstract: Multi-view clustering has shown to be an effective method for analyzing underlying patterns in multi-view data. The performance of clustering can be improved by learning the consistency and complementarity between multi-view features, however, cluster-oriented representation learning is often overlooked. In this paper, we propose a novel Bi-level Decoupling and Consistency Learning framework (BDCL) to further explore the effective representation for multi-view data to enhance inter-cluster discriminability and intra-cluster compactness of features in multi-view clustering. Our framework comprises three modules: 1) The multi-view instance learning module aligns the consistent information while preserving the private features between views through reconstruction autoencoder and contrastive learning. 2) The bi-level decoupling of features and clusters enhances the discriminability of feature space and cluster space. 3) The consistency learning module treats the different views of the sample and their neighbors as positive pairs, learns the consistency of their clustering assignments, and further compresses the intra-cluster space. Experimental results on five benchmark datasets demonstrate the superiority of the proposed method compared with the SOTA methods. Our code is published on https://github.com/LouisDong95/BDCL.

[86] AdaptiveAE: An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes

Tianyi Xu, Fan Zhang, Boxin Shi, Tianfan Xue, Yujin Wang

Main category: cs.CV

TL;DR: AdaptiveAE uses reinforcement learning to optimize shutter speed and ISO selection for HDR imaging, addressing motion blur and noise issues in dynamic scenes.

DetailsMotivation: Existing HDR methods overlook the complex interaction between shutter speed and ISO, and fail to account for motion blur effects in dynamic scenes, leading to suboptimal image quality.

Method: A reinforcement learning-based approach that integrates motion blur and noise simulation into training, using semantic information and exposure histograms to adaptively select optimal ISO and shutter speed sequences within a user-defined exposure time budget.

Result: Achieves state-of-the-art performance across multiple datasets, demonstrating better exposure scheduling than traditional solutions.

Conclusion: AdaptiveAE effectively optimizes exposure parameter selection for high-quality HDR reconstruction in dynamic environments by considering both motion blur and noise factors.

Abstract: Mainstream high dynamic range imaging techniques typically rely on fusing multiple images captured with different exposure setups (shutter speed and ISO). A good balance between shutter speed and ISO is crucial for achieving high-quality HDR, as high ISO values introduce significant noise, while long shutter speeds can lead to noticeable motion blur. However, existing methods often overlook the complex interaction between shutter speed and ISO and fail to account for motion blur effects in dynamic scenes. In this work, we propose AdaptiveAE, a reinforcement learning-based method that optimizes the selection of shutter speed and ISO combinations to maximize HDR reconstruction quality in dynamic environments. AdaptiveAE integrates an image synthesis pipeline that incorporates motion blur and noise simulation into our training procedure, leveraging semantic information and exposure histograms. It can adaptively select optimal ISO and shutter speed sequences based on a user-defined exposure time budget, and find a better exposure schedule than traditional solutions. Experimental results across multiple datasets demonstrate that it achieves the state-of-the-art performance.

[87] A Lightweight Dual-Mode Optimization for Generative Face Video Coding

Zihan Zhang, Shanzhi Yin, Bolin Chen, Ru-Ling Liao, Shiqi Wang, Yan Ye

Main category: cs.CV

TL;DR: Lightweight GFVC framework with dual-mode optimization achieves 90.4% parameter reduction and 88.9% computation saving while maintaining superior performance compared to VVC.

DetailsMotivation: GFVC's practical deployment is hindered by large model parameters and high computational costs, making it unsuitable for resource-constrained environments like mobile edge devices.

Method: Dual-mode optimization combining architectural redesign (replacing 3x3 convolutions with slimmer layers) and operational refinement (two-stage adaptive channel pruning with soft pruning during training and hard pruning post-training).

Result: Achieves 90.4% parameter reduction and 88.9% computation saving compared to baseline, with superior performance to VVC in perceptual-level quality metrics.

Conclusion: The proposed lightweight method enables efficient GFVC deployment in resource-constrained environments while preserving reconstruction quality.

Abstract: Generative Face Video Coding (GFVC) achieves superior rate-distortion performance by leveraging the strong inference capabilities of deep generative models. However, its practical deployment is hindered by large model parameters and high computational costs. To address this, we propose a lightweight GFVC framework that introduces dual-mode optimization – combining architectural redesign and operational refinement – to reduce complexity whilst preserving reconstruction quality. Architecturally, we replace traditional 3 x 3 convolutions with slimmer and more efficient layers, reducing complexity without compromising feature expressiveness. Operationally, we develop a two-stage adaptive channel pruning strategy: (1) soft pruning during training identifies redundant channels via learnable thresholds, and (2) hard pruning permanently eliminates these channels post-training using a derived mask. This dual-phase approach ensures both training stability and inference efficiency. Experimental results demonstrate that the proposed lightweight dual-mode optimization for GFVC can achieve 90.4% parameter reduction and 88.9% computation saving compared to the baseline, whilst achieving superior performance compared to state-of-the-art video coding standard Versatile Video Coding (VVC) in terms of perceptual-level quality metrics. As such, the proposed method is expected to enable efficient GFVC deployment in resource-constrained environments such as mobile edge devices.

[88] Bridging the Gap: Doubles Badminton Analysis with Singles-Trained Models

Seungheon Baek, Jinhyuk Yun

Main category: cs.CV

TL;DR: Transferring singles-trained pose models to doubles badminton analysis using contrastive learning and custom tracking to overcome multi-person challenges

DetailsMotivation: Doubles matches are more prevalent in international tournaments but understudied due to data availability and multi-person tracking challenges, creating a research gap compared to singles analysis

Method: Extracted keypoints from singles dataset using ViT-Pose, embedded through contrastive learning with ST-GCN, incorporated custom multi-object tracking algorithm to resolve ID switching, and used Transformer-based classifier for shot recognition

Result: Demonstrated feasibility of extending pose-based shot recognition to doubles badminton, enabling broader analytics capabilities for this predominant format

Conclusion: Established foundation for doubles-specific datasets to enhance understanding of this fast racket sport format, addressing the previous research gap in doubles analysis

Abstract: Badminton is known as one of the fastest racket sports in the world. Despite doubles matches being more prevalent in international tournaments than singles, previous research has mainly focused on singles due to the challenges in data availability and multi-person tracking. To address this gap, we designed an approach that transfers singles-trained models to doubles analysis. We extracted keypoints from the ShuttleSet single matches dataset using ViT-Pose and embedded them through a contrastive learning framework based on ST-GCN. To improve tracking stability, we incorporated a custom multi-object tracking algorithm that resolves ID switching issues from fast and overlapping player movements. A Transformer-based classifier then determines shot occurrences based on the learned embeddings. Our findings demonstrate the feasibility of extending pose-based shot recognition to doubles badminton, broadening analytics capabilities. This work establishes a foundation for doubles-specific datasets to enhance understanding of this predominant yet understudied format of the fast racket sport.

[89] 2D Gaussians Meet Visual Tokenizer

Yiang Shi, Xiaoyang Guo, Wei Yin, Mingkai Jia, Qian Zhang, Xiaolin Hu, Wenyu Liu, Xinggang Wan

Main category: cs.CV

TL;DR: VGQ is a novel image tokenizer that uses 2D Gaussians to better capture geometric structures, outperforming traditional quantization methods like VQ-GAN in reconstruction quality.

DetailsMotivation: Existing quantization-based tokenizers like VQ-GAN focus primarily on appearance features (texture, color) but neglect geometric structures due to their patch-based design, limiting their ability to model structured visual information.

Method: Proposed Visual Gaussian Quantization (VGQ) framework that encodes image latents as 2D Gaussian distributions, explicitly modeling structure-related parameters such as position, rotation, and scale to capture geometric and spatial structures.

Result: VGQ achieves strong reconstruction quality with rFID score of 1.00 on ImageNet 256x256. With increased Gaussian density, it reaches state-of-the-art performance: rFID score of 0.556 and PSNR of 24.93, substantially outperforming existing methods.

Conclusion: VGQ provides a flexible trade-off between token efficiency and visual richness by incorporating 2D Gaussians, effectively addressing the structural modeling limitations of traditional quantization-based tokenizers.

Abstract: The image tokenizer is a critical component in AR image generation, as it determines how rich and structured visual content is encoded into compact representations. Existing quantization-based tokenizers such as VQ-GAN primarily focus on appearance features like texture and color, often neglecting geometric structures due to their patch-based design. In this work, we explored how to incorporate more visual information into the tokenizer and proposed a new framework named Visual Gaussian Quantization (VGQ), a novel tokenizer paradigm that explicitly enhances structural modeling by integrating 2D Gaussians into traditional visual codebook quantization frameworks. Our approach addresses the inherent limitations of naive quantization methods such as VQ-GAN, which struggle to model structured visual information due to their patch-based design and emphasis on texture and color. In contrast, VGQ encodes image latents as 2D Gaussian distributions, effectively capturing geometric and spatial structures by directly modeling structure-related parameters such as position, rotation and scale. We further demonstrate that increasing the density of 2D Gaussians within the tokens leads to significant gains in reconstruction fidelity, providing a flexible trade-off between token efficiency and visual richness. On the ImageNet 256x256 benchmark, VGQ achieves strong reconstruction quality with an rFID score of 1.00. Furthermore, by increasing the density of 2D Gaussians within the tokens, VGQ gains a significant boost in reconstruction capability and achieves a state-of-the-art reconstruction rFID score of 0.556 and a PSNR of 24.93, substantially outperforming existing methods. Codes will be released soon.

[90] Calibrating Biased Distribution in VFM-derived Latent Space via Cross-Domain Geometric Consistency

Yanbiao Ma, Wei Dai, Bowei Liu, Jiayi Chen, Wenke Huang, Guancheng Wan, Zhiwu Lu, Junchi Yan

Main category: cs.CV

TL;DR: Geometric shapes of foundation model features show remarkable transferability across domains. This geometric knowledge is used to calibrate distributions in federated learning and long-tailed recognition, effectively overcoming data heterogeneity and sample imbalance.

DetailsMotivation: To address the gap between observed training samples and true distribution caused by sampling bias and noise, leveraging the transferable geometric properties of foundation model features.

Method: Uses off-the-shelf vision foundation models (CLIP, DINOv2) for feature extraction, then applies geometric knowledge-guided distribution calibration. In federated learning: acquires global geometric shape under privacy constraints to generate new samples. In long-tailed learning: transfers geometric knowledge from sample-rich to sample-scarce classes.

Result: Comprehensive experiments show the framework effectively overcomes information deficits from data heterogeneity and sample imbalance, with boosted performance across benchmarks.

Conclusion: Geometric properties of foundation model features provide valuable transferable knowledge that can be leveraged for distribution calibration in challenging learning scenarios like federated learning and long-tailed recognition.

Abstract: Despite the fast progress of deep learning, one standing challenge is the gap of the observed training samples and the underlying true distribution. There are multiple reasons for the causing of this gap e.g. sampling bias, noise etc. In the era of foundation models, we show that when leveraging the off-the-shelf (vision) foundation models (e.g., CLIP, DINOv2) for feature extraction, the geometric shapes of the resulting feature distributions exhibit remarkable transferability across domains and datasets. To verify its practical usefulness, we embody our geometric knowledge-guided distribution calibration framework in two popular and challenging settings: federated learning and long-tailed recognition. In the federated setting, we devise a technique of acquiring the global geometric shape under privacy constraints, then leverage this knowledge to generate new samples for clients, in the aim of bridging the gap between local and global observations. In long-tailed learning, it utilizes the geometric knowledge transferred from sample-rich categories to recover the true distribution for sample-scarce tail classes. Comprehensive experiments show that our proposed geometric knowledge-guided distribution calibration effectively overcomes information deficits caused by data heterogeneity and sample imbalance, with boosted performance across benchmarks.

[91] Evaluating Open-Source Vision Language Models for Facial Emotion Recognition against Traditional Deep Learning Models

Vamsi Krishna Mulukutla, Sai Supriya Pavarala, Srinivasa Raju Rudraraju, Sridevi Bonthu

Main category: cs.CV

TL;DR: Traditional deep learning models outperform vision-language models on facial emotion recognition tasks with low-quality images, despite novel image restoration integration.

DetailsMotivation: Facial Emotion Recognition is crucial for human-computer interaction and mental health applications, but existing VLMs may not perform well with noisy, low-resolution data typical in real-world FER scenarios.

Method: Empirical comparison of open-source VLMs (Phi-3.5 Vision, CLIP) vs traditional models (VGG19, ResNet-50, EfficientNet-B0) on FER-2013 dataset using a novel pipeline with GFPGAN-based image restoration and comprehensive evaluation metrics.

Result: Traditional models significantly outperformed VLMs: EfficientNet-B0 (86.44%), ResNet-50 (85.72%) vs CLIP (64.07%) and Phi-3.5 Vision (51.66%). Computational cost analysis provided practical deployment insights.

Conclusion: VLMs have limitations in low-quality visual tasks; adaptation to noisy environments is needed. The study provides a reproducible benchmark for future emotion recognition research.

Abstract: Facial Emotion Recognition (FER) is crucial for applications such as human-computer interaction and mental health diagnostics. This study presents the first empirical comparison of open-source Vision-Language Models (VLMs), including Phi-3.5 Vision and CLIP, against traditional deep learning models VGG19, ResNet-50, and EfficientNet-B0 on the challenging FER-2013 dataset, which contains 35,887 low-resolution grayscale images across seven emotion classes. To address the mismatch between VLM training assumptions and the noisy nature of FER data, we introduce a novel pipeline that integrates GFPGAN-based image restoration with FER evaluation. Results show that traditional models, particularly EfficientNet-B0 (86.44%) and ResNet-50 (85.72%), significantly outperform VLMs like CLIP (64.07%) and Phi-3.5 Vision (51.66%), highlighting the limitations of VLMs in low-quality visual tasks. In addition to performance evaluation using precision, recall, F1-score, and accuracy, we provide a detailed computational cost analysis covering preprocessing, training, inference, and evaluation phases, offering practical insights for deployment. This work underscores the need for adapting VLMs to noisy environments and provides a reproducible benchmark for future research in emotion recognition.

[92] EAvatar: Expression-Aware Head Avatar Reconstruction with Generative Geometry Priors

Shikun Zhang, Cunjian Chen, Yiqun Wang, Qiuhong Ke, Yong Li

Main category: cs.CV

TL;DR: EAvatar: A 3DGS-based framework for high-fidelity head avatar reconstruction that improves expression accuracy and texture continuity using sparse expression control and 3D priors.

DetailsMotivation: Existing 3D Gaussian Splatting methods struggle with capturing fine-grained facial expressions and preserving local texture continuity in highly deformable regions for head avatar reconstruction.

Method: Proposes expression-aware and deformation-aware framework with sparse expression control mechanism using key Gaussians to influence neighboring deformation, and leverages 3D priors from pretrained generative models for facial geometry guidance.

Result: Produces more accurate and visually coherent head reconstructions with improved expression controllability and detail fidelity compared to existing methods.

Conclusion: EAvatar effectively addresses limitations in current 3DGS-based head avatar reconstruction by combining sparse expression control with 3D priors, achieving superior performance in expression modeling and texture preservation.

Abstract: High-fidelity head avatar reconstruction plays a crucial role in AR/VR, gaming, and multimedia content creation. Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated effectiveness in modeling complex geometry with real-time rendering capability and are now widely used in high-fidelity head avatar reconstruction tasks. However, existing 3DGS-based methods still face significant challenges in capturing fine-grained facial expressions and preserving local texture continuity, especially in highly deformable regions. To mitigate these limitations, we propose a novel 3DGS-based framework termed EAvatar for head reconstruction that is both expression-aware and deformation-aware. Our method introduces a sparse expression control mechanism, where a small number of key Gaussians are used to influence the deformation of their neighboring Gaussians, enabling accurate modeling of local deformations and fine-scale texture transitions. Furthermore, we leverage high-quality 3D priors from pretrained generative models to provide a more reliable facial geometry, offering structural guidance that improves convergence stability and shape accuracy during training. Experimental results demonstrate that our method produces more accurate and visually coherent head reconstructions with improved expression controllability and detail fidelity.

[93] FLAIR: Frequency- and Locality-Aware Implicit Neural Representations

Sukhun Ko, Dahyeon Kye, Kyle Min, Chanho Eom, Jihyong Oh

Main category: cs.CV

TL;DR: FLAIR introduces frequency- and locality-aware implicit neural representations with RC-GAUSS activation and wavelet-energy-guided encoding to overcome spectral bias and improve signal representation.

DetailsMotivation: Existing implicit neural representations lack frequency selectivity, spatial localization, and sparse representations, leading to spectral bias where they struggle to capture high-frequency details while over-relying on redundant low-frequency components.

Method: Proposes FLAIR with two key innovations: 1) RC-GAUSS activation for explicit frequency selection and spatial localization under time-frequency uncertainty principle constraints, and 2) Wavelet-Energy-Guided Encoding (WEGE) that uses discrete wavelet transform to compute energy scores and explicitly guide frequency information to the network.

Result: The method consistently outperforms existing implicit neural representations in 2D image representation and restoration, as well as 3D reconstruction tasks.

Conclusion: FLAIR successfully addresses the limitations of traditional INRs by incorporating frequency awareness and spatial localization, demonstrating superior performance across multiple vision tasks including 2D and 3D applications.

Abstract: Implicit Neural Representations (INRs) leverage neural networks to map coordinates to corresponding signals, enabling continuous and compact representations. This paradigm has driven significant advances in various vision tasks. However, existing INRs lack frequency selectivity, spatial localization, and sparse representations, leading to an over-reliance on redundant signal components. Consequently, they exhibit spectral bias, tending to learn low-frequency components early while struggling to capture fine high-frequency details. To address these issues, we propose FLAIR (Frequency- and Locality-Aware Implicit Neural Representations), which incorporates two key innovations. The first is RC-GAUSS, a novel activation designed for explicit frequency selection and spatial localization under the constraints of the time-frequency uncertainty principle (TFUP). The second is Wavelet-Energy-Guided Encoding (WEGE), which leverages the discrete wavelet transform (DWT) to compute energy scores and explicitly guide frequency information to the network. Our method consistently outperforms existing INRs in 2D image representation and restoration, as well as 3D reconstruction.

[94] GazeProphet: Software-Only Gaze Prediction for VR Foveated Rendering

Farhaan Ebadulla, Chiraag Mudlapur, Gaurav BV

Main category: cs.CV

TL;DR: GazeProphet is a software-only gaze prediction system for VR that eliminates need for expensive eye tracking hardware, achieving 3.83° median angular error and 24% improvement over traditional methods.

DetailsMotivation: Current foveated rendering approaches require expensive hardware-based eye tracking systems, limiting widespread adoption due to cost, calibration complexity, and hardware compatibility constraints.

Method: Combines Spherical Vision Transformer for processing 360-degree VR scenes with LSTM-based temporal encoder for gaze sequence patterns, using multi-modal fusion network to integrate spatial scene features with temporal gaze dynamics.

Result: Achieves median angular error of 3.83 degrees, outperforming traditional saliency-based baselines by 24% with reliable confidence calibration. Maintains consistent performance across different spatial regions and scene types.

Conclusion: Software-only gaze prediction can effectively work for VR foveated rendering, making performance improvements more accessible across different VR platforms and applications without additional hardware requirements.

Abstract: Foveated rendering significantly reduces computational demands in virtual reality applications by concentrating rendering quality where users focus their gaze. Current approaches require expensive hardware-based eye tracking systems, limiting widespread adoption due to cost, calibration complexity, and hardware compatibility constraints. This paper presents GazeProphet, a software-only approach for predicting gaze locations in VR environments without requiring dedicated eye tracking hardware. The approach combines a Spherical Vision Transformer for processing 360-degree VR scenes with an LSTM-based temporal encoder that captures gaze sequence patterns. A multi-modal fusion network integrates spatial scene features with temporal gaze dynamics to predict future gaze locations with associated confidence estimates. Experimental evaluation on a comprehensive VR dataset demonstrates that GazeProphet achieves a median angular error of 3.83 degrees, outperforming traditional saliency-based baselines by 24% while providing reliable confidence calibration. The approach maintains consistent performance across different spatial regions and scene types, enabling practical deployment in VR systems without additional hardware requirements. Statistical analysis confirms the significance of improvements across all evaluation metrics. These results show that software-only gaze prediction can work for VR foveated rendering, making this performance boost more accessible to different VR platforms and apps.

[95] Rethinking Transformer-Based Blind-Spot Network for Self-Supervised Image Denoising

Junyi Li, Zhilu Zhang, Wangmeng Zuo

Main category: cs.CV

TL;DR: Transformer-based Blind-Spot Network (TBSN) that redesigns channel and spatial attention mechanisms to meet blind-spot requirements for self-supervised image denoising, with knowledge distillation for efficiency.

DetailsMotivation: Transformers show potential for image restoration but their attention mechanisms violate blind-spot requirements, limiting their use in blind-spot networks for self-supervised denoising.

Method: Redesign channel attention by grouping channels and performing attention separately to prevent information leakage; apply masked spatial attention to mimic dilated convolution receptive field; use knowledge distillation to create efficient smaller denoisers.

Result: TBSN extends receptive field and shows favorable performance against state-of-the-art SSID methods on real-world image denoising datasets.

Conclusion: The proposed Transformer-based Blind-Spot Network successfully adapts transformer architectures for self-supervised denoising while maintaining blind-spot constraints, offering both strong local fitting and global perspective capabilities.

Abstract: Blind-spot networks (BSN) have been prevalent neural architectures in self-supervised image denoising (SSID). However, most existing BSNs are conducted with convolution layers. Although transformers have shown the potential to overcome the limitations of convolutions in many image restoration tasks, the attention mechanisms may violate the blind-spot requirement, thereby restricting their applicability in BSN. To this end, we propose to analyze and redesign the channel and spatial attentions to meet the blind-spot requirement. Specifically, channel self-attention may leak the blind-spot information in multi-scale architectures, since the downsampling shuffles the spatial feature into channel dimensions. To alleviate this problem, we divide the channel into several groups and perform channel attention separately. For spatial selfattention, we apply an elaborate mask to the attention matrix to restrict and mimic the receptive field of dilated convolution. Based on the redesigned channel and window attentions, we build a Transformer-based Blind-Spot Network (TBSN), which shows strong local fitting and global perspective abilities. Furthermore, we introduce a knowledge distillation strategy that distills TBSN into smaller denoisers to improve computational efficiency while maintaining performance. Extensive experiments on real-world image denoising datasets show that TBSN largely extends the receptive field and exhibits favorable performance against state-of-theart SSID methods.

[96] Color Spike Data Generation via Bio-inspired Neuron-like Encoding with an Artificial Photoreceptor Layer

Hsieh Ching-Teng, Wang Yuan-Kai

Main category: cs.CV

TL;DR: A biologically-inspired neuron-like encoding method that generates spike data with enhanced color and luminance information, improving SNN performance while maintaining neuromorphic principles.

DetailsMotivation: SNNs lag behind CNNs due to limited spike-based information capacity, and current methods using non-spiking inputs deviate from neuromorphic computing's spike-based processing intent.

Method: Proposed Neuron-like Encoding method that generates spike data based on biological neuron principles, enhanced with artificial photoreceptor layer for color and luminance information.

Result: Experimental results using Integrate-and-Fire neuron model show increased information content in spike signals and improved SNN performance.

Conclusion: This biologically inspired approach effectively addresses SNN limitations while adhering to neuromorphic principles, showing strong potential for future neuromorphic computing development.

Abstract: In recent years, neuromorphic computing and spiking neural networks (SNNs) have ad-vanced rapidly through integration with deep learning. However, the performance of SNNs still lags behind that of convolutional neural networks (CNNs), primarily due to the limited information capacity of spike-based data. Although some studies have attempted to improve SNN performance by training them with non-spiking inputs such as static images, this approach deviates from the original intent of neuromorphic computing, which emphasizes spike-based information processing. To address this issue, we propose a Neuron-like Encoding method that generates spike data based on the intrinsic operational principles and functions of biological neurons. This method is further enhanced by the incorporation of an artificial pho-toreceptor layer, enabling spike data to carry both color and luminance information, thereby forming a complete visual spike signal. Experimental results using the Integrate-and-Fire neuron model demonstrate that this biologically inspired approach effectively increases the information content of spike signals and improves SNN performance, all while adhering to neuromorphic principles. We believe this concept holds strong potential for future development and may contribute to overcoming current limitations in neuro-morphic computing, facilitating broader applications of SNNs.

[97] DictAS: A Framework for Class-Generalizable Few-Shot Anomaly Segmentation via Dictionary Lookup

Zhen Qu, Xian Tao, Xinyi Gong, ShiChen Qu, Xiaopei Zhang, Xingang Wang, Fei Shen, Zhengtao Zhang, Mukesh Prasad, Guiguang Ding

Main category: cs.CV

TL;DR: DictAS is a novel framework for few-shot anomaly segmentation that uses dictionary lookup capabilities instead of memorizing patterns, enabling detection in unseen object categories without retraining.

DetailsMotivation: Existing vision-language models depend on prior knowledge of real anomaly samples for cross-category generalization, limiting their effectiveness in few-shot anomaly segmentation for unseen classes.

Method: DictAS consists of three components: Dictionary Construction using normal reference images, Dictionary Lookup with sparse retrieval strategy to identify anomalies, and Query Discrimination Regularization with Contrastive Query Constraint and Text Alignment Constraint to enhance discrimination.

Result: Extensive experiments on seven public industrial and medical datasets show that DictAS consistently outperforms state-of-the-art FSAS methods.

Conclusion: The framework successfully transfers dictionary lookup capabilities to anomaly segmentation tasks, enabling effective detection in unseen categories without requiring retraining on target data.

Abstract: Recent vision-language models (e.g., CLIP) have demonstrated remarkable class-generalizable ability to unseen classes in few-shot anomaly segmentation (FSAS), leveraging supervised prompt learning or fine-tuning on seen classes. However, their cross-category generalization largely depends on prior knowledge of real seen anomaly samples. In this paper, we propose a novel framework, namely DictAS, which enables a unified model to detect visual anomalies in unseen object categories without any retraining on the target data, only employing a few normal reference images as visual prompts. The insight behind DictAS is to transfer dictionary lookup capabilities to the FSAS task for unseen classes via self-supervised learning, instead of merely memorizing the normal and abnormal feature patterns from the training set. Specifically, DictAS mainly consists of three components: (1) Dictionary Construction - to simulate the index and content of a real dictionary using features from normal reference images. (2) Dictionary Lookup - to retrieve queried region features from the dictionary via a sparse lookup strategy. When a query feature cannot be retrieved, it is classified as an anomaly. (3) Query Discrimination Regularization- to enhance anomaly discrimination by making abnormal features harder to retrieve from the dictionary. To achieve this, Contrastive Query Constraint and Text Alignment Constraint are further proposed. Extensive experiments on seven public industrial and medical datasets demonstrate that DictAS consistently outperforms state-of-the-art FSAS methods.

[98] Learnable SMPLify: A Neural Solution for Optimization-Free Human Pose Inverse Kinematics

Yuchen Yang, Linfeng Dong, Wei Wang, Zhihang Zhong, Xiao Sun

Main category: cs.CV

TL;DR: Learnable SMPLify replaces SMPLify’s iterative optimization with a neural network for 200x faster 3D human pose estimation while maintaining accuracy.

DetailsMotivation: SMPLify's high computational cost limits practicality, and recent trends show neural networks can achieve runtime improvements without sacrificing accuracy.

Method: Neural framework using single-pass regression instead of iterative fitting, with temporal sampling for data construction, human-centric normalization, and residual learning for generalization.

Result: Achieves nearly 200x faster runtime compared to SMPLify, generalizes well to unseen datasets (3DPW and RICH), and works as model-agnostic plug-in tool.

Conclusion: Learnable SMPLify establishes itself as a practical and simple baseline for efficient 3D human pose and shape estimation with broad generalization capabilities.

Abstract: In 3D human pose and shape estimation, SMPLify remains a robust baseline that solves inverse kinematics (IK) through iterative optimization. However, its high computational cost limits its practicality. Recent advances across domains have shown that replacing iterative optimization with data-driven neural networks can achieve significant runtime improvements without sacrificing accuracy. Motivated by this trend, we propose Learnable SMPLify, a neural framework that replaces the iterative fitting process in SMPLify with a single-pass regression model. The design of our framework targets two core challenges in neural IK: data construction and generalization. To enable effective training, we propose a temporal sampling strategy that constructs initialization-target pairs from sequential frames. To improve generalization across diverse motions and unseen poses, we propose a human-centric normalization scheme and residual learning to narrow the solution space. Learnable SMPLify supports both sequential inference and plug-in post-processing to refine existing image-based estimators. Extensive experiments demonstrate that our method establishes itself as a practical and simple baseline: it achieves nearly 200x faster runtime compared to SMPLify, generalizes well to unseen 3DPW and RICH, and operates in a model-agnostic manner when used as a plug-in tool on LucidAction. The code is available at https://github.com/Charrrrrlie/Learnable-SMPLify.

[99] Hyperspectral Image Generation with Unmixing Guided Diffusion Model

Shiyu Shen, Bin Pan, Ziye Zhang, Zhenwei Shi

Main category: cs.CV

TL;DR: A diffusion framework for hyperspectral image synthesis that uses hyperspectral unmixing to address high dimensionality and physical constraints, producing high-quality diverse HSIs.

DetailsMotivation: Hyperspectral image synthesis is limited by conditional generative paradigms that restrict sample diversity, and direct extension of diffusion models from RGB to hyperspectral domains faces challenges due to high spectral dimensionality and strict physical constraints.

Method: A diffusion framework guided by hyperspectral unmixing with two components: (1) an unmixing autoencoder that projects generation into low-dimensional abundance manifold, and (2) an abundance diffusion process that enforces non-negativity and sum-to-one constraints for physical consistency.

Result: The method produces hyperspectral images with both high quality and diversity, advancing state-of-the-art in hyperspectral data generation as demonstrated through comprehensive experiments with conventional and proposed metrics.

Conclusion: The proposed diffusion framework successfully overcomes challenges in hyperspectral image synthesis by integrating hyperspectral unmixing guidance, enabling high-fidelity generation while maintaining spectral fidelity and physical consistency.

Abstract: We address hyperspectral image (HSI) synthesis, a problem that has garnered growing interest yet remains constrained by the conditional generative paradigms that limit sample diversity. While diffusion models have emerged as a state-of-the-art solution for high-fidelity image generation, their direct extension from RGB to hyperspectral domains is challenged by the high spectral dimensionality and strict physical constraints inherent to HSIs. To overcome the challenges, we introduce a diffusion framework explicitly guided by hyperspectral unmixing. The approach integrates two collaborative components: (i) an unmixing autoencoder that projects generation from the image domain into a low-dimensional abundance manifold, thereby reducing computational burden while maintaining spectral fidelity; and (ii) an abundance diffusion process that enforces non-negativity and sum-to-one constraints, ensuring physical consistency of the synthesized data. We further propose two evaluation metrics tailored to hyperspectral characteristics. Comprehensive experiments, assessed with both conventional measures and the proposed metrics, demonstrate that our method produces HSIs with both high quality and diversity, advancing the state of the art in hyperspectral data generation.

[100] The 9th AI City Challenge

Zheng Tang, Shuo Wang, David C. Anastasiu, Ming-Ching Chang, Anuj Sharma, Quan Kong, Norimasa Kobori, Munkhjargal Gochoo, Ganzorig Batnasan, Munkh-Erdene Otgonbold, Fady Alnajjar, Jun-Wei Hsieh, Tomasz Kornuta, Xiaolong Li, Yilin Zhao, Han Zhang, Subhashree Radhakrishnan, Arihant Jain, Ratnesh Kumar, Vidya N. Murali, Yuxing Wang, Sameer Satish Pusegaonkar, Yizhou Wang, Sujit Biswas, Xunlei Wu, Zhedong Zheng, Pranamesh Chakraborty, Rama Chellappa

Main category: cs.CV

TL;DR: The 2025 AI City Challenge featured 4 tracks with increased participation, focusing on multi-class 3D tracking, video QA for traffic safety, spatial reasoning in warehouses, and efficient fisheye object detection, with teams achieving top results using Omniverse-generated datasets.

DetailsMotivation: To advance real-world applications of computer vision and AI in transportation, industrial automation, and public safety through competitive benchmarking and dataset development.

Method: Four challenge tracks: 1) Multi-class 3D multi-camera tracking with detailed calibration, 2) Video question answering with 3D gaze labels, 3) Fine-grained spatial reasoning using RGB-D inputs, 4) Efficient fisheye object detection for edge devices. Used NVIDIA Omniverse for dataset generation and enforced submission limits with partially held-out test sets.

Result: 17% increase in participation with 245 teams from 15 countries, over 30,000 dataset downloads, and several teams achieving top-tier results setting new benchmarks in multiple tasks.

Conclusion: The challenge successfully fostered reproducibility, mitigated overfitting, and advanced state-of-the-art performance in computer vision applications for transportation and industrial automation through comprehensive benchmarking and high-quality dataset releases.

Abstract: The ninth AI City Challenge continues to advance real-world applications of computer vision and AI in transportation, industrial automation, and public safety. The 2025 edition featured four tracks and saw a 17% increase in participation, with 245 teams from 15 countries registered on the evaluation server. Public release of challenge datasets led to over 30,000 downloads to date. Track 1 focused on multi-class 3D multi-camera tracking, involving people, humanoids, autonomous mobile robots, and forklifts, using detailed calibration and 3D bounding box annotations. Track 2 tackled video question answering in traffic safety, with multi-camera incident understanding enriched by 3D gaze labels. Track 3 addressed fine-grained spatial reasoning in dynamic warehouse environments, requiring AI systems to interpret RGB-D inputs and answer spatial questions that combine perception, geometry, and language. Both Track 1 and Track 3 datasets were generated in NVIDIA Omniverse. Track 4 emphasized efficient road object detection from fisheye cameras, supporting lightweight, real-time deployment on edge devices. The evaluation framework enforced submission limits and used a partially held-out test set to ensure fair benchmarking. Final rankings were revealed after the competition concluded, fostering reproducibility and mitigating overfitting. Several teams achieved top-tier results, setting new benchmarks in multiple tasks.

[101] Boosting Adversarial Transferability for Hyperspectral Image Classification Using 3D Structure-invariant Transformation and Weighted Intermediate Feature Divergence

Chun Liu, Bingqian Zhu, Tao Xu, Zheng Zheng, Zheng Li, Wei Yang, Zhigang Han, Jiayao Wang

Main category: cs.CV

TL;DR: Proposes a novel adversarial attack method for hyperspectral image classification that enhances transferability using 3D structure-invariant transformations and weighted intermediate feature divergence.

DetailsMotivation: Deep Neural Networks are vulnerable to adversarial attacks, and hyperspectral images present unique challenges due to their high-dimensional spectral information that differs from natural images.

Method: Divides HSIs into 3D blocks in spatial and spectral dimensions, applies various transformations while keeping structure invariant, and uses weighted intermediate feature divergence loss to constrain perturbation direction and assign different weights to feature channels.

Result: Extensive experiments show the method achieves more effective adversarial transferability on three public HSI datasets and maintains robust attack performance even under defense strategies.

Conclusion: The proposed method successfully addresses the specific challenges of generating adversarial examples for hyperspectral images and demonstrates superior transferability compared to existing approaches.

Abstract: Deep Neural Networks (DNNs) are vulnerable to adversarial attacks, which pose security challenges to hyperspectral image (HSI) classification based on DNNs. Numerous adversarial attack methods have been designed in the domain of natural images. However, different from natural images, HSIs contains high-dimensional rich spectral information, which presents new challenges for generating adversarial examples. Based on the specific characteristics of HSIs, this paper proposes a novel method to enhance the transferability of the adversarial examples for HSI classification using 3D structure-invariant transformation and weighted intermediate feature divergence. While keeping the HSIs structure invariant, the proposed method divides the image into blocks in both spatial and spectral dimensions. Then, various transformations are applied on each block to increase input diversity and mitigate the overfitting to substitute models. Moreover, a weighted intermediate feature divergence loss is also designed by leveraging the differences between the intermediate features of original and adversarial examples. It constrains the perturbation direction by enlarging the feature maps of the original examples, and assigns different weights to different feature channels to destroy the features that have a greater impact on HSI classification. Extensive experiments demonstrate that the adversarial examples generated by the proposed method achieve more effective adversarial transferability on three public HSI datasets. Furthermore, the method maintains robust attack performance even under defense strategies.

[102] Generative Model-Based Feature Attention Module for Video Action Analysis

Guiqin Wang, Peng Zhao, Cong Zhao, Jing Huang, Siyan Guo, Shusen Yang

Main category: cs.CV

TL;DR: Proposes a generative attention-based model for video action analysis that learns feature semantics relations by leveraging foreground-background differences, improving performance for IoT applications like autonomous driving.

DetailsMotivation: Existing video action analysis methods overlook feature semantics and focus on action proposal optimization, making them unsuitable for high-performance IoT applications that require precision and scalability.

Method: A novel generative attention-based model that simultaneously learns frame- and segment-dependencies of temporal action feature semantics by leveraging foreground-background differences in actions.

Result: Extensive experiments on action recognition and action detection benchmarks show superior performance, with comprehensive validation on widely recognized datasets.

Conclusion: The proposed model effectively utilizes feature semantics in feature extraction and demonstrates strong performance across multiple video action analysis tasks, making it suitable for high-performance IoT applications.

Abstract: Video action analysis is a foundational technology within the realm of intelligent video comprehension, particularly concerning its application in Internet of Things(IoT). However, existing methodologies overlook feature semantics in feature extraction and focus on optimizing action proposals, thus these solutions are unsuitable for widespread adoption in high-performance IoT applications due to the limitations in precision, such as autonomous driving, which necessitate robust and scalable intelligent video analytics analysis. To address this issue, we propose a novel generative attention-based model to learn the relation of feature semantics. Specifically, by leveraging the differences of actions’ foreground and background, our model simultaneously learns the frame- and segment-dependencies of temporal action feature semantics, which takes advantage of feature semantics in the feature extraction effectively. To evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark video task, action recognition and action detection. In the context of action detection tasks, we substantiate the superiority of our approach through comprehensive validation on widely recognized datasets. Moreover, we extend the validation of the effectiveness of our proposed method to a broader task, video action recognition. Our code is available at https://github.com/Generative-Feature-Model/GAF.

[103] Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers

Wongi Jeong, Kyungryeol Lee, Hoigi Seo, Se Young Chun

Main category: cs.CV

TL;DR: RALU accelerates diffusion transformers by performing mixed-resolution sampling across three stages: low-resolution denoising, region-adaptive upsampling, and full-resolution refinement, achieving up to 7x speed-up with minimal quality degradation.

DetailsMotivation: Diffusion transformers offer superior scalability for image/video generation but suffer from heavy computation that hinders real-world deployment. Existing acceleration methods focus on temporal dimension optimization, leaving spatial acceleration unexplored.

Method: Region-Adaptive Latent Upsampling (RALU) - a training-free framework with three stages: 1) low-resolution denoising to capture global semantics, 2) region-adaptive upsampling on artifact-prone areas, 3) full-resolution latent upsampling for detail refinement. Uses noise-timestep rescheduling for stable resolution transitions.

Result: Achieves up to 7.0x speed-up on FLUX and 3.0x on Stable Diffusion 3 with minimal quality degradation. The method is complementary to existing temporal acceleration techniques and can be integrated for further latency reduction.

Conclusion: RALU provides an effective spatial acceleration approach for diffusion transformers that significantly reduces computation while maintaining generation quality, and can be combined with temporal optimization methods for even greater efficiency gains.

Abstract: Diffusion transformers have emerged as an alternative to U-net-based diffusion models for high-fidelity image and video generation, offering superior scalability. However, their heavy computation remains a major obstacle to real-world deployment. Existing acceleration methods primarily exploit the temporal dimension such as reusing cached features across diffusion timesteps. Here, we propose Region-Adaptive Latent Upsampling (RALU), a training-free framework that accelerates inference along spatial dimension. RALU performs mixed-resolution sampling across three stages: 1) low-resolution denoising latent diffusion to efficiently capture global semantic structure, 2) region-adaptive upsampling on specific regions prone to artifacts at full-resolution, and 3) all latent upsampling at full-resolution for detail refinement. To stabilize generations across resolution transitions, we leverage noise-timestep rescheduling to adapt the noise level across varying resolutions. Our method significantly reduces computation while preserving image quality by achieving up to 7.0$\times$ speed-up on FLUX and 3.0$\times$ on Stable Diffusion 3 with minimal degradation. Furthermore, RALU is complementary to existing temporal accelerations such as caching methods, thus can be seamlessly integrated to further reduce inference latency without compromising generation quality.

[104] Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model

Ruixin Zhang, Jiaqing Fan, Yifan Liao, Qian Qiao, Fanzhang Li

Main category: cs.CV

TL;DR: A new RVOS model that improves segmentation head design, integrates text-to-video diffusion features, removes noise prediction, and adds a TCMR module, achieving state-of-the-art results on four benchmarks.

DetailsMotivation: Recent RVOS approaches focus too much on feature extraction and temporal modeling while neglecting segmentation head design, leaving room for improvement in boundary segmentation capability.

Method: Proposes Temporal-Conditional RVOS model that integrates existing segmentation methods, uses text-to-video diffusion for feature extraction, removes noise prediction module to avoid randomness, and adds Temporal Context Mask Refinement (TCMR) module to overcome VAE limitations.

Result: The method achieves state-of-the-art performance on four public RVOS benchmarks, demonstrating improved segmentation quality and accuracy.

Conclusion: The proposed approach effectively addresses segmentation head design limitations in RVOS, simplifies the model while improving performance, and significantly enhances segmentation quality without complex designs.

Abstract: Referring Video Object Segmentation (RVOS) aims to segment specific objects in a video according to textual descriptions. We observe that recent RVOS approaches often place excessive emphasis on feature extraction and temporal modeling, while relatively neglecting the design of the segmentation head. In fact, there remains considerable room for improvement in segmentation head design. To address this, we propose a Temporal-Conditional Referring Video Object Segmentation model, which innovatively integrates existing segmentation methods to effectively enhance boundary segmentation capability. Furthermore, our model leverages a text-to-video diffusion model for feature extraction. On top of this, we remove the traditional noise prediction module to avoid the randomness of noise from degrading segmentation accuracy, thereby simplifying the model while improving performance. Finally, to overcome the limited feature extraction capability of the VAE, we design a Temporal Context Mask Refinement (TCMR) module, which significantly improves segmentation quality without introducing complex designs. We evaluate our method on four public RVOS benchmarks, where it consistently achieves state-of-the-art performance.

[105] Automatic Image Colorization with Convolutional Neural Networks and Generative Adversarial Networks

Changyuan Qiu, Hangrui Cao, Qihan Ren, Ruiyu Li, Yuqing Qiu

Main category: cs.CV

TL;DR: Automatic image colorization using classification and adversarial learning approaches to address the multi-modal nature of color prediction.

DetailsMotivation: Traditional colorization methods treat it as a regression task which ignores the multi-modal nature of color prediction. The problem is challenging due to being highly ill-posed with two out of three image dimensions lost, but scene semantics and texture provide important color cues.

Method: Build models on prior works, apply modifications for specific scenarios, and use classification and adversarial learning approaches instead of regression to handle the multi-modal nature of color prediction.

Result: The paper proposes exploring alternative approaches to colorization but does not present specific experimental results in this abstract.

Conclusion: Classification and adversarial learning frameworks are more suitable for image colorization than regression approaches due to the multi-modal nature of the problem, and building upon prior works with modifications can lead to improved performance.

Abstract: Image colorization, the task of adding colors to grayscale images, has been the focus of significant research efforts in computer vision in recent years for its various application areas such as color restoration and automatic animation colorization [15, 1]. The colorization problem is challenging as it is highly ill-posed with two out of three image dimensions lost, resulting in large degrees of freedom. However, semantics of the scene as well as the surface texture could provide important cues for colors: the sky is typically blue, the clouds are typically white and the grass is typically green, and there are huge amounts of training data available for learning such priors since any colored image could serve as a training data point [20]. Colorization is initially formulated as a regression task[5], which ignores the multi-modal nature of color prediction. In this project, we explore automatic image colorization via classification and adversarial learning. We will build our models on prior works, apply modifications for our specific scenario and make comparisons.

[106] Bridging Clear and Adverse Driving Conditions

Yoel Shapiro, Yahia Showgan, Koustav Mullick

Main category: cs.CV

TL;DR: Proposed domain adaptation pipeline transforms clear-weather images into adverse conditions (fog, rain, snow, nighttime) using hybrid diffusion-GAN approaches to improve autonomous driving performance without costly real data collection.

DetailsMotivation: Autonomous driving systems perform poorly in adverse weather conditions due to underrepresentation in datasets, and collecting/annotating real adverse weather data is prohibitively expensive.

Method: Developed multiple data-generation pipelines including simulation-only, GAN-based, and hybrid diffusion-GAN approaches. Extended existing DA GAN with auxiliary inputs, created training recipe using both simulated and real images, and introduced adaptive blending method to reduce hallucinations in Stable-Diffusion outputs.

Result: Achieved 1.85% overall improvement in semantic segmentation and 4.62% improvement specifically on nighttime conditions when evaluated on the Adverse Conditions Dataset with Correspondences (ACDC).

Conclusion: The hybrid domain adaptation method effectively bridges the simulation-to-real gap and significantly enhances autonomous driving perception performance under challenging adverse weather conditions.

Abstract: Autonomous Driving (AD) systems exhibit markedly degraded performance under adverse environmental conditions, such as low illumination and precipitation. The underrepresentation of adverse conditions in AD datasets makes it challenging to address this deficiency. To circumvent the prohibitive cost of acquiring and annotating adverse weather data, we propose a novel Domain Adaptation (DA) pipeline that transforms clear-weather images into fog, rain, snow, and nighttime images. Here, we systematically develop and evaluate several novel data-generation pipelines, including simulation-only, GAN-based, and hybrid diffusion-GAN approaches, to synthesize photorealistic adverse images from labelled clear images. We leverage an existing DA GAN, extend it to support auxiliary inputs, and develop a novel training recipe that leverages both simulated and real images. The simulated images facilitate exact supervision by providing perfectly matched image pairs, while the real images help bridge the simulation-to-real (sim2real) gap. We further introduce a method to mitigate hallucinations and artifacts in Stable-Diffusion Image-to-Image (img2img) outputs by blending them adaptively with their progenitor images. We finetune downstream models on our synthetic data and evaluate them on the Adverse Conditions Dataset with Correspondences (ACDC). We achieve 1.85 percent overall improvement in semantic segmentation, and 4.62 percent on nighttime, demonstrating the efficacy of our hybrid method for robust AD perception under challenging conditions.

[107] RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

Tianyi Niu, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal

Main category: cs.CV

TL;DR: MLLMs struggle with image rotation identification, particularly distinguishing 90° vs 270° rotations, despite being a simple task that humans perform easily.

DetailsMotivation: To evaluate the spatial reasoning capabilities of Multimodal Large Language Models by testing their ability to identify image rotations, which requires robust visual reasoning to detect rotational cues and contextualize spatial relationships.

Method: Created RotBench - a 350-image benchmark with lifestyle, portrait, and landscape images. Tested state-of-the-art MLLMs (GPT-5, o3, Gemini-2.5-Pro) on identifying 0°, 90°, 180°, and 270° rotations. Used auxiliary information (captions, depth maps), chain-of-thought prompting, simultaneous multi-orientation presentation, and voting setups.

Result: Most models reliably identify 0° images, some identify 180° images, but none can reliably distinguish 90° vs 270° rotations. Auxiliary information and prompting provided only small improvements. Fine-tuning improved 180° identification but not 90°/270° distinction.

Conclusion: There’s a significant gap between MLLMs’ spatial reasoning capabilities and human perception in rotation identification, revealing fundamental limitations in current multimodal models’ visual reasoning abilities.

Abstract: We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0{\deg}, 90{\deg}, 180{\deg}, and 270{\deg}. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench – a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information – including captions, depth maps, and more – or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0{\deg}) images, while certain models are able to identify upside-down (180{\deg}) images. None can reliably distinguish between 90{\deg} and 270{\deg}. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models’ ability to distinguish 90{\deg} and 270{\deg} rotations, despite substantially improving the identification of 180{\deg} images. Together, these results reveal a significant gap between MLLMs’ spatial reasoning capabilities and human perception in identifying rotation.

[108] Towards Efficient Vision State Space Models via Token Merging

Jinyoung Park, Minseok Son, Changick Kim

Main category: cs.CV

TL;DR: MaMe is a token-merging strategy designed specifically for State Space Models (SSMs) in vision tasks that uses state transition parameters to measure token importance and preserves sequential information flow, achieving superior efficiency-performance trade-offs across image, video, and audio domains.

DetailsMotivation: Improving computational efficiency of State Space Models for practical deployment while maintaining their unique sequential modeling capabilities, as existing token reduction methods don't properly address SSMs' sequential properties.

Method: Proposes MaMe strategy that leverages state transition parameter Δ as an informativeness measure for tokens and introduces strategic token arrangements to preserve sequential information flow during token merging.

Result: Achieves superior efficiency-performance trade-offs, maintains robustness under aggressive token reduction where other methods fail, and demonstrates strong generalization across image classification, video, and audio domains.

Conclusion: MaMe establishes an effective approach for enhancing efficiency in diverse SSM applications while preserving the sequential modeling capabilities that make SSMs powerful architectures.

Abstract: State Space Models (SSMs) have emerged as powerful architectures in computer vision, yet improving their computational efficiency remains crucial for practical and scalable deployment.While token reduction serves as an effective approach for model efficiency, applying it to SSMs requires careful consideration of their unique sequential modeling capabilities.In this work, we propose MaMe, a token-merging strategy tailored for SSM-based vision models.MaMe addresses two key challenges: quantifying token importance and preserving sequential properties. Our approach leverages the state transition parameter $\mathbf{\Delta}$ as an informativeness measure and introduces strategic token arrangements to preserve sequential information flow.Extensive experiments demonstrate that MaMe achieves superior efficiency-performance trade-offs for both fine-tuned and off-the-shelf models. Particularly, our approach maintains robustness even under aggressive token reduction where existing methods undergo significant performance degradation.Beyond image classification, MaMe shows strong generalization capabilities across video and audio domains, establishing an effective approach for enhancing efficiency in diverse SSM applications.

[109] Unleashing Semantic and Geometric Priors for 3D Scene Completion

Shiyuan Chen, Wei Sui, Bohao Zhang, Zeyd Boukhers, John See, Cong Yang

Main category: cs.CV

TL;DR: FoundationSSC is a novel 3D semantic scene completion framework that uses dual decoupling at source and pathway levels to separate semantic and geometric processing, achieving state-of-the-art performance on benchmark datasets.

DetailsMotivation: Existing camera-based 3D semantic scene completion methods rely on coupled encoders that force trade-offs between conflicting semantic and geometric demands, limiting overall performance.

Method: Proposes dual decoupling: at source level using foundation encoder for semantic features and stereo cost volumes; at pathway level with specialized decoupled pathways. Uses hybrid view transformation and novel Axis-Aware Fusion module to anisotropically merge features.

Result: Achieves +0.23 mIoU and +2.03 IoU improvements on SemanticKITTI, and state-of-the-art 21.78 mIoU and 48.61 IoU on SSCBench-KITTI-360.

Conclusion: The dual-decoupling design effectively addresses the conflict between semantic and geometric processing, enabling simultaneous improvements in both aspects and superior overall performance in 3D semantic scene completion.

Abstract: Camera-based 3D semantic scene completion (SSC) provides dense geometric and semantic perception for autonomous driving and robotic navigation. However, existing methods rely on a coupled encoder to deliver both semantic and geometric priors, which forces the model to make a trade-off between conflicting demands and limits its overall performance. To tackle these challenges, we propose FoundationSSC, a novel framework that performs dual decoupling at both the source and pathway levels. At the source level, we introduce a foundation encoder that provides rich semantic feature priors for the semantic branch and high-fidelity stereo cost volumes for the geometric branch. At the pathway level, these priors are refined through specialised, decoupled pathways, yielding superior semantic context and depth distributions. Our dual-decoupling design produces disentangled and refined inputs, which are then utilised by a hybrid view transformation to generate complementary 3D features. Additionally, we introduce a novel Axis-Aware Fusion (AAF) module that addresses the often-overlooked challenge of fusing these features by anisotropically merging them into a unified representation. Extensive experiments demonstrate the advantages of FoundationSSC, achieving simultaneous improvements in both semantic and geometric metrics, surpassing prior bests by +0.23 mIoU and +2.03 IoU on SemanticKITTI. Additionally, we achieve state-of-the-art performance on SSCBench-KITTI-360, with 21.78 mIoU and 48.61 IoU. The code will be released upon acceptance.

[110] PersonaVlog: Personalized Multimodal Vlog Generation with Multi-Agent Collaboration and Iterative Self-Correction

Xiaolu Hou, Bing Ma, Jiaxiang Cheng, Xuhua Ren, Kai Yu, Wenyue Li, Tianxiang Zheng, Qinglin Lu

Main category: cs.CV

TL;DR: PersonaVlog is an automated multimodal Vlog generation framework that uses MLLM-based multi-agent collaboration to create personalized Vlogs with videos, music, and speech from theme and image inputs, featuring iterative self-correction and a new evaluation benchmark.

DetailsMotivation: Growing demand for short videos and personalized content, with existing methods relying on predefined scripts lacking dynamism and personal expression, creating need for automated Vlog generation with effective multimodal collaboration and high personalization.

Method: Multi-agent collaboration framework based on Multimodal Large Language Models (MLLMs) that generates high-quality prompts for multimodal content creation, incorporates feedback and rollback mechanism for iterative self-correction, and includes ThemeVlogEval benchmarking framework.

Result: Comprehensive experiments demonstrate significant advantages and potential over several baselines, showing effectiveness for generating automated Vlogs with improved efficiency and creativity.

Conclusion: The proposed PersonaVlog framework effectively addresses the need for automated, personalized Vlog generation with multimodal collaboration and self-correction capabilities, showing great potential for automated content creation.

Abstract: With the growing demand for short videos and personalized content, automated Video Log (Vlog) generation has become a key direction in multimodal content creation. Existing methods mostly rely on predefined scripts, lacking dynamism and personal expression. Therefore, there is an urgent need for an automated Vlog generation approach that enables effective multimodal collaboration and high personalization. To this end, we propose PersonaVlog, an automated multimodal stylized Vlog generation framework that can produce personalized Vlogs featuring videos, background music, and inner monologue speech based on a given theme and reference image. Specifically, we propose a multi-agent collaboration framework based on Multimodal Large Language Models (MLLMs). This framework efficiently generates high-quality prompts for multimodal content creation based on user input, thereby improving the efficiency and creativity of the process. In addition, we incorporate a feedback and rollback mechanism that leverages MLLMs to evaluate and provide feedback on generated results, thereby enabling iterative self-correction of multimodal content. We also propose ThemeVlogEval, a theme-based automated benchmarking framework that provides standardized metrics and datasets for fair evaluation. Comprehensive experiments demonstrate the significant advantages and potential of our framework over several baselines, highlighting its effectiveness and great potential for generating automated Vlogs.

[111] Two-Factor Authentication Smart Entryway Using Modified LBPH Algorithm

Zakiah Ayop, Wan Mohamad Hariz Bin Wan Mohamad Rosdi, Looi Wei Hua, Syarulnaziah Anawar, Nur Fadzilah Othman

Main category: cs.CV

TL;DR: A smart entryway system combining facial recognition and passcode verification for two-factor authentication, with mask detection and remote control via Telegram on Raspberry Pi.

DetailsMotivation: Address the lack of IoT development in face mask detection during COVID-19 pandemic and enhance smart entryway security with automated alerts and surveillance.

Method: Uses Local Binary Patterns Histograms (LBPH) for full face recognition and modified LBPH algorithm for occluded face detection on Raspberry Pi platform with Telegram integration for remote control.

Result: Achieved 70% accuracy, 80% precision, and 83.26% recall across tested users, successfully automating door control, user registration, and owner notifications.

Conclusion: The system effectively performs face recognition and mask detection with high user acceptance, demonstrating practical applicability for smart entryway security systems.

Abstract: Face mask detection has become increasingly important recently, particularly during the COVID-19 pandemic. Many face detection models have been developed in smart entryways using IoT. However, there is a lack of IoT development on face mask detection. This paper proposes a two-factor authentication system for smart entryway access control using facial recognition and passcode verification and an automation process to alert the owner and activate the surveillance system when a stranger is detected and controls the system remotely via Telegram on a Raspberry Pi platform. The system employs the Local Binary Patterns Histograms for the full face recognition algorithm and modified LBPH algorithm for occluded face detection. On average, the system achieved an Accuracy of approximately 70%, a Precision of approximately 80%, and a Recall of approximately 83.26% across all tested users. The results indicate that the system is capable of conducting face recognition and mask detection, automating the operation of the remote control to register users, locking or unlocking the door, and notifying the owner. The sample participants highly accept it for future use in the user acceptance test.

[112] TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis

Shunian Chen, Hejin Huang, Yexin Liu, Zihan Ye, Pengcheng Chen, Chenghao Zhu, Michael Guan, Rongsheng Wang, Junying Chen, Guanbin Li, Ser-Nam Lim, Harry Yang, Benyou Wang

Main category: cs.CV

TL;DR: TalkVid introduces a large-scale diverse dataset for audio-driven talking head synthesis to address generalization gaps in current models across different ethnicities, languages, and age groups.

DetailsMotivation: Current state-of-the-art talking head synthesis models fail to generalize across the full spectrum of human diversity due to limitations in training data scale, quality, and diversity.

Method: Created TalkVid dataset with 1244 hours of video from 7729 unique speakers using a multi-stage automated pipeline that filters for motion stability, aesthetic quality, and facial detail. Also developed TalkVid-Bench evaluation set with 500 clips balanced across demographic and linguistic axes.

Result: Models trained on TalkVid outperform counterparts trained on previous datasets and show superior cross-dataset generalization. Analysis reveals performance disparities across subgroups that traditional aggregate metrics miss.

Conclusion: TalkVid addresses critical data limitations in talking head synthesis and provides necessary tools for evaluating model performance across diverse demographic groups, enabling more inclusive and generalizable AI systems.

Abstract: Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in existing training data, which lack the necessary scale, quality, and diversity. To address this challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability. Furthermore, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes. Our experiments demonstrate that a model trained on TalkVid outperforms counterparts trained on previous datasets, exhibiting superior cross-dataset generalization. Crucially, our analysis on TalkVid-Bench reveals performance disparities across subgroups that are obscured by traditional aggregate metrics, underscoring its necessity for future research. Code and data can be found in https://github.com/FreedomIntelligence/TalkVid

[113] RCGNet: RGB-based Category-Level 6D Object Pose Estimation with Geometric Guidance

Sheng Yu, Di-Hua Zhai, Yuanqing Xia

Main category: cs.CV

TL;DR: A novel RGB-only category-level object pose estimation method using transformer networks to predict geometric features and RANSAC-PnP for pose computation, achieving superior accuracy without depth data.

DetailsMotivation: Current RGB-D methods struggle in scenes lacking depth information, creating a need for accurate pose estimation using only RGB images in real-world scenarios.

Method: Transformer-based neural network for geometric feature prediction and fusion, geometric feature-guided algorithm for faithful geometry capture, and RANSAC-PnP algorithm for pose computation handling variable object scales.

Result: Highly efficient and achieves superior accuracy compared to previous RGB-based methods on benchmark datasets.

Conclusion: Provides a new perspective for advancing category-level object pose estimation using only RGB images, demonstrating strong performance without depth data requirements.

Abstract: While most current RGB-D-based category-level object pose estimation methods achieve strong performance, they face significant challenges in scenes lacking depth information. In this paper, we propose a novel category-level object pose estimation approach that relies solely on RGB images. This method enables accurate pose estimation in real-world scenarios without the need for depth data. Specifically, we design a transformer-based neural network for category-level object pose estimation, where the transformer is employed to predict and fuse the geometric features of the target object. To ensure that these predicted geometric features faithfully capture the object’s geometry, we introduce a geometric feature-guided algorithm, which enhances the network’s ability to effectively represent the object’s geometric information. Finally, we utilize the RANSAC-PnP algorithm to compute the object’s pose, addressing the challenges associated with variable object scales in pose estimation. Experimental results on benchmark datasets demonstrate that our approach is not only highly efficient but also achieves superior accuracy compared to previous RGB-based methods. These promising results offer a new perspective for advancing category-level object pose estimation using RGB images.

[114] DiffIER: Optimizing Diffusion Models with Iterative Error Reduction

Ao Chen, Lihe Ding, Tianfan Xue

Main category: cs.CV

TL;DR: This paper identifies a “training-inference gap” in diffusion models that causes sensitivity to guidance weight selection and proposes DiffIER, an optimization-based method to reduce accumulated error during inference for improved conditional generation quality.

DetailsMotivation: Classifier-Free Guidance (CFG) in diffusion models shows sensitivity to guidance weight selection, and the authors identify a critical "training-inference gap" that undermines conditional generation performance and makes outputs highly sensitive to guidance weight choices.

Method: The authors propose DiffIER, an optimization-based method that performs iterative error minimization at each step during inference to reduce accumulated error. This plug-and-play framework optimizes errors at every inference step to enhance generation quality.

Result: Empirical results show that DiffIER outperforms baseline approaches in conditional generation tasks and achieves consistent success across text-to-image generation, image super-resolution, and text-to-speech generation.

Conclusion: The proposed method effectively mitigates the training-inference gap in diffusion models, demonstrating versatility and potential for broad applications in various conditional generation domains through its optimization-based approach to error reduction during inference.

Abstract: Diffusion models have demonstrated remarkable capabilities in generating high-quality samples and enhancing performance across diverse domains through Classifier-Free Guidance (CFG). However, the quality of generated samples is highly sensitive to the selection of the guidance weight. In this work, we identify a critical ``training-inference gap’’ and we argue that it is the presence of this gap that undermines the performance of conditional generation and renders outputs highly sensitive to the guidance weight. We quantify this gap by measuring the accumulated error during the inference stage and establish a correlation between the selection of guidance weight and minimizing this gap. Furthermore, to mitigate this gap, we propose DiffIER, an optimization-based method for high-quality generation. We demonstrate that the accumulated error can be effectively reduced by an iterative error minimization at each step during inference. By introducing this novel plug-and-play optimization framework, we enable the optimization of errors at every single inference step and enhance generation quality. Empirical results demonstrate that our proposed method outperforms baseline approaches in conditional generation tasks. Furthermore, the method achieves consistent success in text-to-image generation, image super-resolution, and text-to-speech generation, underscoring its versatility and potential for broad applications in future research.

[115] OmniTry: Virtual Try-On Anything without Masks

Yutong Feng, Linlin Zhang, Hengyuan Cao, Yiming Chen, Xiaoduan Feng, Jian Cao, Yuxiong Wu, Bin Wang

Main category: cs.CV

TL;DR: OmniTry is a unified virtual try-on framework that extends beyond clothes to any wearable objects like jewelry and accessories, using a two-stage training approach with mask-free localization and appearance consistency transfer.

DetailsMotivation: Existing VTON works focus mainly on clothes, but practical applications require trying on various wearable objects. Data curation for paired images is challenging when extending to diverse object types.

Method: Two-stage pipeline: 1) Use unpaired images to train mask-free localization by repurposing inpainting models, 2) Fine-tune with few paired samples for object appearance consistency transfer.

Result: OmniTry outperforms existing methods on both object localization and ID-preservation across 12 classes of wearable objects, with quick convergence even with few paired samples.

Conclusion: The framework successfully extends VTON to various wearable objects with mask-free setting, showing practical applicability and superior performance compared to existing methods.

Abstract: Virtual Try-ON (VTON) is a practical and widely-applied task, for which most of existing works focus on clothes. This paper presents OmniTry, a unified framework that extends VTON beyond garment to encompass any wearable objects, e.g., jewelries and accessories, with mask-free setting for more practical application. When extending to various types of objects, data curation is challenging for obtaining paired images, i.e., the object image and the corresponding try-on result. To tackle this problem, we propose a two-staged pipeline: For the first stage, we leverage large-scale unpaired images, i.e., portraits with any wearable items, to train the model for mask-free localization. Specifically, we repurpose the inpainting model to automatically draw objects in suitable positions given an empty mask. For the second stage, the model is further fine-tuned with paired images to transfer the consistency of object appearance. We observed that the model after the first stage shows quick convergence even with few paired samples. OmniTry is evaluated on a comprehensive benchmark consisting of 12 common classes of wearable objects, with both in-shop and in-the-wild images. Experimental results suggest that OmniTry shows better performance on both object localization and ID-preservation compared with existing methods. The code, model weights, and evaluation benchmark of OmniTry will be made publicly available at https://omnitry.github.io/.

[116] DeH4R: A Decoupled and Hybrid Method for Road Network Graph Extraction

Dengxian Gong, Shunping Ji

Main category: cs.CV

TL;DR: DeH4R is a hybrid model for road network extraction that combines graph-generating efficiency with graph-growing dynamics, achieving state-of-the-art performance with faster inference speed.

DetailsMotivation: Existing methods for road network extraction have limitations: segmentation-based approaches struggle with topology fidelity after vectorization, graph-growing methods are computationally expensive, and graph-generating methods lack dynamic vertex insertion capabilities.

Method: The model decouples the task into four components: candidate vertex detection, adjacent vertex prediction, initial graph construction, and graph expansion. This hybrid approach enables dynamic vertex/edge insertions while maintaining fast inference.

Result: DeH4R outperforms prior SOTA method RNGDet++ by 4.62 APLS and 10.18 IoU on CityScale benchmark, while being approximately 10x faster. It demonstrates superior performance on both CityScale and SpaceNet benchmarks.

Conclusion: The proposed hybrid architecture successfully addresses key challenges in road network extraction by combining efficiency and dynamic capabilities, achieving both high topology fidelity and spatial consistency with significantly improved speed.

Abstract: The automated extraction of complete and precise road network graphs from remote sensing imagery remains a critical challenge in geospatial computer vision. Segmentation-based approaches, while effective in pixel-level recognition, struggle to maintain topology fidelity after vectorization postprocessing. Graph-growing methods build more topologically faithful graphs but suffer from computationally prohibitive iterative ROI cropping. Graph-generating methods first predict global static candidate road network vertices, and then infer possible edges between vertices. They achieve fast topology-aware inference, but limits the dynamic insertion of vertices. To address these challenges, we propose DeH4R, a novel hybrid model that combines graph-generating efficiency and graph-growing dynamics. This is achieved by decoupling the task into candidate vertex detection, adjacent vertex prediction, initial graph contruction, and graph expansion. This architectural innovation enables dynamic vertex (edge) insertions while retaining fast inference speed and enhancing both topology fidelity and spatial consistency. Comprehensive evaluations on CityScale and SpaceNet benchmarks demonstrate state-of-the-art (SOTA) performance. DeH4R outperforms the prior SOTA graph-growing method RNGDet++ by 4.62 APLS and 10.18 IoU on CityScale, while being approximately 10 $\times$ faster. The code will be made publicly available at https://github.com/7777777FAN/DeH4R.

[117] HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes

Keliang Li, Hongze Shen, Hao Shi, Ruibing Hou, Hong Chang, Jie Huang, Chenghao Jia, Wen Wang, Yiling Wu, Dongmei Jiang, Shiguang Shan, Xilin Chen

Main category: cs.CV

TL;DR: HumanPCR is a comprehensive evaluation suite for multimodal models that assesses human-centric visual understanding across three hierarchical levels: Perception, Comprehension, and Reasoning, revealing significant challenges in current models.

DetailsMotivation: The rapid progress in multimodal models demands human-comparable performance across diverse environments, particularly in understanding human-related visual contexts, which existing benchmarks often overlook.

Method: The authors created HumanPCR with over 6,000 human-verified multiple choice questions across 9 dimensions for Perception and Comprehension levels, plus a manually curated video reasoning test that requires integrating multiple visual evidences and proactive context extraction.

Result: Evaluation of over 30 state-of-the-art models shows significant challenges in human-centric visual understanding, especially in detailed space perception, temporal understanding, and mind modeling. Models struggle with proactive visual evidence extraction and rely too heavily on query-guided retrieval.

Conclusion: Current multimodal models face substantial limitations in human-centric visual understanding, and even advanced techniques provide only limited improvements. HumanPCR provides a valuable benchmark for advancing development and evaluation of human-centric multimodal applications.

Abstract: The aspiration for artificial general intelligence, fueled by the rapid progress of multimodal models, demands human-comparable performance across diverse environments. We propose HumanPCR, an evaluation suite for probing MLLMs’ capacity about human-related visual contexts across three hierarchical levels: Perception, Comprehension, and Reasoning (denoted by Human-P, Human-C, and Human-R, respectively). Human-P and Human-C feature over 6,000 human-verified multiple choice questions, assessing massive tasks of 9 dimensions, including but not limited to essential skills frequently overlooked by existing benchmarks. Human-R offers a challenging manually curated video reasoning test that requires integrating multiple visual evidences, proactively extracting context beyond question cues, and applying human-like expertise. Each question includes human-annotated Chain-of-Thought (CoT) rationales with key visual evidence to support further research. Extensive evaluations on over 30 state-of-the-art models exhibit significant challenges in human-centric visual understanding, particularly in tasks involving detailed space perception, temporal understanding, and mind modeling. Moreover, analysis of Human-R reveals the struggle of models in extracting essential proactive visual evidence from diverse human scenes and their faulty reliance on query-guided retrieval. Even with advanced techniques like scaling visual contexts and test-time thinking yield only limited benefits. We hope HumanPCR and our findings will advance the development, evaluation, and human-centric application of multimodal models.

[118] Diversity-enhanced Collaborative Mamba for Semi-supervised Medical Image Segmentation

Shumeng Li, Jian Zhang, Lei Qi, Luping Zhou, Yinghuan Shi, Yang Gao

Main category: cs.CV

TL;DR: DCMamba is a novel semi-supervised medical image segmentation framework that leverages Mamba’s state space modeling to handle long-range dependencies and enhances diversity through data, network, and feature perspectives.

DetailsMotivation: High-quality annotated medical image data is expensive and time-consuming to acquire. Semi-supervised methods can reduce this burden by using unlabeled data, and Mamba models show promise for handling long-range dependencies in segmentation tasks.

Method: Proposes Diversity-enhanced Collaborative Mamba (DCMamba) with three key components: 1) patch-level weak-strong mixing augmentation, 2) diverse-scan collaboration module leveraging different scanning directions, and 3) uncertainty-weighted contrastive learning for feature diversity.

Result: Significantly outperforms other semi-supervised methods, achieving 6.69% improvement over the latest SSM-based method on Synapse dataset with only 20% labeled data.

Conclusion: DCMamba effectively combines Mamba’s long-range dependency handling with diversity enhancement strategies across multiple perspectives, demonstrating state-of-the-art performance in semi-supervised medical image segmentation.

Abstract: Acquiring high-quality annotated data for medical image segmentation is tedious and costly. Semi-supervised segmentation techniques alleviate this burden by leveraging unlabeled data to generate pseudo labels. Recently, advanced state space models, represented by Mamba, have shown efficient handling of long-range dependencies. This drives us to explore their potential in semi-supervised medical image segmentation. In this paper, we propose a novel Diversity-enhanced Collaborative Mamba framework (namely DCMamba) for semi-supervised medical image segmentation, which explores and utilizes the diversity from data, network, and feature perspectives. Firstly, from the data perspective, we develop patch-level weak-strong mixing augmentation with Mamba’s scanning modeling characteristics. Moreover, from the network perspective, we introduce a diverse-scan collaboration module, which could benefit from the prediction discrepancies arising from different scanning directions. Furthermore, from the feature perspective, we adopt an uncertainty-weighted contrastive learning mechanism to enhance the diversity of feature representation. Experiments demonstrate that our DCMamba significantly outperforms other semi-supervised medical image segmentation methods, e.g., yielding the latest SSM-based method by 6.69% on the Synapse dataset with 20% labeled data.

[119] Hierarchical Vision-Language Retrieval of Educational Metaverse Content in Agriculture

Ali Abdari, Alex Falcon, Giuseppe Serra

Main category: cs.CV

TL;DR: New dataset of 457 agricultural virtual museums with text descriptions and hierarchical vision-language model for Metaverse content retrieval

DetailsMotivation: Organizing educational content in Metaverse for easier learning, but current datasets are too small and search remains challenging

Method: Created AgriMuseums dataset and proposed hierarchical vision-language model for natural language query-based retrieval

Result: Achieved 62% R@1 and 78% MRR, improved existing benchmarks by 6% R@1 and 11% MRR

Conclusion: Effective approach for Metaverse educational content organization and retrieval with validated design choices

Abstract: Every day, a large amount of educational content is uploaded online across different areas, including agriculture and gardening. When these videos or materials are grouped meaningfully, they can make learning easier and more effective. One promising way to organize and enrich such content is through the Metaverse, which allows users to explore educational experiences in an interactive and immersive environment. However, searching for relevant Metaverse scenarios and finding those matching users’ interests remains a challenging task. A first step in this direction has been done recently, but existing datasets are small and not sufficient for training advanced models. In this work, we make two main contributions: first, we introduce a new dataset containing 457 agricultural-themed virtual museums (AgriMuseums), each enriched with textual descriptions; and second, we propose a hierarchical vision-language model to represent and retrieve relevant AgriMuseums using natural language queries. In our experimental setting, the proposed method achieves up to about 62% R@1 and 78% MRR, confirming its effectiveness, and it also leads to improvements on existing benchmarks by up to 6% R@1 and 11% MRR. Moreover, an extensive evaluation validates our design choices. Code and dataset are available at https://github.com/aliabdari/Agricultural_Metaverse_Retrieval .

[120] Enhancing Targeted Adversarial Attacks on Large Vision-Language Models through Intermediate Projector Guidance

Yiming Cao, Yanjie Li, Kaisheng Liang, Yuni Lai, Bin Xiao

Main category: cs.CV

TL;DR: IPGA is a targeted adversarial attack method that uses intermediate projector guidance (Q-Former) to achieve fine-grained image manipulation while preserving background content, outperforming existing methods in both global captioning and fine-grained VQA tasks.

DetailsMotivation: Current adversarial attack methods collapse rich visual semantics into single global vectors, limiting attack granularity and failing to disrupt the full vision-language alignment pipeline by overlooking critical projector modules in VLMs.

Method: Proposes Intermediate Projector Guided Attack (IPGA) that attacks using Q-Former’s intermediate stage to transform global embeddings into fine-grained visual tokens, plus Residual Query Alignment (RQA) to preserve unrelated visual content.

Result: Extensive experiments show IPGA consistently outperforms existing methods in both standard global image captioning and fine-grained visual question-answering tasks in black-box environments, with successful transfer to commercial VLMs like Google Gemini and OpenAI GPT.

Conclusion: IPGA enables more precise control over adversarial perturbations by operating on semantically meaningful visual tokens rather than single global representations, improving both attack effectiveness and transferability across diverse VLMs.

Abstract: Targeted adversarial attacks are essential for proactively identifying security flaws in Vision-Language Models before real-world deployment. However, current methods perturb images to maximize global similarity with the target text or reference image at the encoder level, collapsing rich visual semantics into a single global vector. This limits attack granularity, hindering fine-grained manipulations such as modifying a car while preserving its background. Furthermore, these methods largely overlook the projector module, a critical semantic bridge between the visual encoder and the language model in VLMs, thereby failing to disrupt the full vision-language alignment pipeline within VLMs and limiting attack effectiveness. To address these issues, we propose the Intermediate Projector Guided Attack (IPGA), the first method to attack using the intermediate stage of the projector module, specifically the widely adopted Q-Former, which transforms global image embeddings into fine-grained visual features. This enables more precise control over adversarial perturbations by operating on semantically meaningful visual tokens rather than a single global representation. Specifically, IPGA leverages the Q-Former pretrained solely on the first vision-language alignment stage, without LLM fine-tuning, which improves both attack effectiveness and transferability across diverse VLMs. Furthermore, we propose Residual Query Alignment (RQA) to preserve unrelated visual content, thereby yielding more controlled and precise adversarial manipulations. Extensive experiments show that our attack method consistently outperforms existing methods in both standard global image captioning tasks and fine-grained visual question-answering tasks in black-box environment. Additionally, IPGA successfully transfers to multiple commercial VLMs, including Google Gemini and OpenAI GPT.

[121] Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks

Yeji Park, Minyoung Lee, Sanghyuk Chun, Junsuk Choe

Main category: cs.CV

TL;DR: FOCUS is a training-free decoding strategy that mitigates cross-image information leakage in Large Vision-Language Models by sequentially masking images with noise and aggregating logits to improve multi-image reasoning performance.

DetailsMotivation: LVLMs show strong performance on single-image tasks but suffer from significant performance degradation when handling multi-image inputs due to visual cues from different images becoming entangled (cross-image information leakage).

Method: FOCUS sequentially masks all but one image with random noise during inference, guides the model to focus on single clean images, aggregates logits across all target images, and contrastively refines them using a noise-only reference input to suppress leakage.

Result: FOCUS consistently improves performance across four multi-image benchmarks and diverse LVLM families, demonstrating enhanced multi-image reasoning capabilities.

Conclusion: FOCUS provides a general and practical solution for improving multi-image reasoning in LVLMs without requiring additional training or architectural modifications.

Abstract: Large Vision-Language Models (LVLMs) demonstrate strong performance on single-image tasks. However, we observe that their performance degrades significantly when handling multi-image inputs. This occurs because visual cues from different images become entangled in the model’s output. We refer to this phenomenon as cross-image information leakage. To address this issue, we propose FOCUS, a training-free and architecture-agnostic decoding strategy that mitigates cross-image information leakage during inference. FOCUS sequentially masks all but one image with random noise, guiding the model to focus on the single clean image. We repeat this process across all target images to obtain logits under partially masked contexts. These logits are aggregated and then contrastively refined using a noise-only reference input, which suppresses the leakage and yields more accurate outputs. FOCUS consistently improves performance across four multi-image benchmarks and diverse LVLM families. This demonstrates that FOCUS offers a general and practical solution for enhancing multi-image reasoning without additional training or architectural modifications.

[122] MR6D: Benchmarking 6D Pose Estimation for Mobile Robots

Anas Gouda, Shrutarv Awasthi, Christian Blesing, Lokeshwaran Manohar, Frank Hoffmann, Alice Kirchheim

Main category: cs.CV

TL;DR: MR6D is a new dataset for 6D pose estimation specifically designed for mobile robotics in industrial environments, addressing limitations of existing household-focused datasets.

DetailsMotivation: Existing 6D pose estimation datasets focus on small household objects for robot arms, but mobile robots face different challenges like long-range perception, larger objects, heavy occlusion, and diverse camera perspectives that current datasets don't address.

Method: The authors created MR6D dataset with 92 real-world scenes featuring 16 unique objects across static and dynamic interactions, capturing mobile robotics challenges including distant viewpoints, varied configurations, larger object sizes, and complex occlusion patterns.

Result: Initial experiments show current 6D pose estimation pipelines underperform in mobile robotics settings, with 2D segmentation being an additional challenge. The dataset establishes a benchmark for mobile robotics pose estimation.

Conclusion: MR6D provides a foundation for developing and evaluating 6D pose estimation methods specifically tailored to mobile robotics demands in industrial environments, addressing the gap left by household-focused datasets.

Abstract: Existing 6D pose estimation datasets primarily focus on small household objects typically handled by robot arm manipulators, limiting their relevance to mobile robotics. Mobile platforms often operate without manipulators, interact with larger objects, and face challenges such as long-range perception, heavy self-occlusion, and diverse camera perspectives. While recent models generalize well to unseen objects, evaluations remain confined to household-like settings that overlook these factors. We introduce MR6D, a dataset designed for 6D pose estimation for mobile robots in industrial environments. It includes 92 real-world scenes featuring 16 unique objects across static and dynamic interactions. MR6D captures the challenges specific to mobile platforms, including distant viewpoints, varied object configurations, larger object sizes, and complex occlusion/self-occlusion patterns. Initial experiments reveal that current 6D pipelines underperform in these settings, with 2D segmentation being another hurdle. MR6D establishes a foundation for developing and evaluating pose estimation methods tailored to the demands of mobile robotics. The dataset is available at https://huggingface.co/datasets/anas-gouda/mr6d.

[123] Shape-from-Template with Generalised Camera

Agniva Sengupta, Stefan Zachow

Main category: cs.CV

TL;DR: Novel method for non-rigid 3D shape registration to 2D keypoints using multiple cameras via generalized camera model, with three approaches: known 3D point direction, unknown 3D point with known orientation, and silhouette-based registration.

DetailsMotivation: Extend Shape-from-Template (SfT) from single images to multi-camera setups for improved accuracy in applications like medical imaging and hand-held camera registration, leveraging mutual constraints between multiple views.

Method: Three approaches using generalized camera model: 1) keypoints on direction vectors from known 3D points, 2) keypoints from unknown 3D points with known orientation, 3) silhouette-based approach. Correspondence methods use convex programming, silhouette method uses iterative refinement.

Result: Demonstrated accurate non-rigid 3D shape registration on both synthetic and real data, showing improved reconstruction accuracy by utilizing multi-view constraints.

Conclusion: First comprehensive solution set for SfT with generalized cameras, enabling robust non-rigid registration across various camera configurations and deformation scenarios with practical applications.

Abstract: This article presents a new method for non-rigidly registering a 3D shape to 2D keypoints observed by a constellation of multiple cameras. Non-rigid registration of a 3D shape to observed 2D keypoints, i.e., Shape-from-Template (SfT), has been widely studied using single images, but SfT with information from multiple-cameras jointly opens new directions for extending the scope of known use-cases such as 3D shape registration in medical imaging and registration from hand-held cameras, to name a few. We represent such multi-camera setup with the generalised camera model; therefore any collection of perspective or orthographic cameras observing any deforming object can be registered. We propose multiple approaches for such SfT: the first approach where the corresponded keypoints lie on a direction vector from a known 3D point in space, the second approach where the corresponded keypoints lie on a direction vector from an unknown 3D point in space but with known orientation w.r.t some local reference frame, and a third approach where, apart from correspondences, the silhouette of the imaged object is also known. Together, these form the first set of solutions to the SfT problem with generalised cameras. The key idea behind SfT with generalised camera is the improved reconstruction accuracy from estimating deformed shape while utilising the additional information from the mutual constraints between multiple views of a deformed object. The correspondence-based approaches are solved with convex programming while the silhouette-based approach is an iterative refinement of the results from the convex solutions. We demonstrate the accuracy of our proposed methods on many synthetic and real data

[124] VisionLaw: Inferring Interpretable Intrinsic Dynamics from Visual Observations via Bilevel Optimization

Jiajing Lin, Shu Jiang, Qingyuan Zeng, Zhenzhong Wang, Min Jiang

Main category: cs.CV

TL;DR: VisionLaw is a bilevel optimization framework that uses LLMs as physics experts to generate interpretable constitutive laws from visual observations, addressing limitations of existing methods through decoupled evolution and vision-guided evaluation.

DetailsMotivation: Existing methods for inferring intrinsic dynamics from visual observations either rely on manually defined priors that don't generalize well, or use neural networks that lack interpretability and generalization capabilities.

Method: A bilevel optimization framework: upper level uses LLMs as physics experts to generate and revise constitutive laws with decoupling mechanism; lower level uses vision-guided simulation to evaluate consistency and guide evolution.

Result: Experiments on synthetic and real-world datasets show VisionLaw effectively infers interpretable intrinsic dynamics, significantly outperforms state-of-the-art methods, and exhibits strong generalization for interactive simulation in novel scenarios.

Conclusion: VisionLaw successfully addresses the challenges of interpretability and generalization in intrinsic dynamics inference by combining LLMs’ physics knowledge with visual simulation guidance, enabling physically plausible interactive simulation with 3D assets.

Abstract: The intrinsic dynamics of an object governs its physical behavior in the real world, playing a critical role in enabling physically plausible interactive simulation with 3D assets. Existing methods have attempted to infer the intrinsic dynamics of objects from visual observations, but generally face two major challenges: one line of work relies on manually defined constitutive priors, making it difficult to generalize to complex scenarios; the other models intrinsic dynamics using neural networks, resulting in limited interpretability and poor generalization. To address these challenges, we propose VisionLaw, a bilevel optimization framework that infers interpretable expressions of intrinsic dynamics from visual observations. At the upper level, we introduce an LLMs-driven decoupled constitutive evolution strategy, where LLMs are prompted as a knowledgeable physics expert to generate and revise constitutive laws, with a built-in decoupling mechanism that substantially reduces the search complexity of LLMs. At the lower level, we introduce a vision-guided constitutive evaluation mechanism, which utilizes visual simulation to evaluate the consistency between the generated constitutive law and the underlying intrinsic dynamics, thereby guiding the upper-level evolution. Experiments on both synthetic and real-world datasets demonstrate that VisionLaw can effectively infer interpretable intrinsic dynamics from visual observations. It significantly outperforms existing state-of-the-art methods and exhibits strong generalization for interactive simulation in novel scenarios.

[125] A Fully Transformer Based Multimodal Framework for Explainable Cancer Image Segmentation Using Radiology Reports

Enobong Adahada, Isabel Sassoon, Kate Hone, Yongmin Li

Main category: cs.CV

TL;DR: Med-CTX is a transformer-based multimodal framework that integrates clinical radiology reports with ultrasound images for explainable breast cancer segmentation, achieving state-of-the-art performance with 99% Dice score and providing uncertainty maps and diagnostic explanations.

DetailsMotivation: To improve both performance and interpretability in breast cancer ultrasound segmentation by integrating clinical radiology reports, enabling clinically grounded explanations and increasing confidence in computer-assisted diagnosis.

Method: Uses dual-branch visual encoder (ViT + Swin transformers) with uncertainty-aware fusion, encodes clinical language with BI-RADS semantics using BioClinicalBERT, and employs cross-modal attention to combine visual and textual features for segmentation and explanation generation.

Result: Achieves 99% Dice score and 95% IoU on BUS-BRA dataset, outperforming U-Net, ViT, and Swin baselines. Shows strong multimodal alignment (85% CLIP score) and improved confidence calibration (3.2% ECE). Ablation studies demonstrate clinical text’s critical role (-5.4% Dice decline without text).

Conclusion: Med-CTX sets a new standard for trustworthy multimodal medical architecture by simultaneously generating segmentation masks, uncertainty maps, and diagnostic rationales, with clinical text proving essential for both accuracy and explanation quality.

Abstract: We introduce Med-CTX, a fully transformer based multimodal framework for explainable breast cancer ultrasound segmentation. We integrate clinical radiology reports to boost both performance and interpretability. Med-CTX achieves exact lesion delineation by using a dual-branch visual encoder that combines ViT and Swin transformers, as well as uncertainty aware fusion. Clinical language structured with BI-RADS semantics is encoded by BioClinicalBERT and combined with visual features utilising cross-modal attention, allowing the model to provide clinically grounded, model generated explanations. Our methodology generates segmentation masks, uncertainty maps, and diagnostic rationales all at once, increasing confidence and transparency in computer assisted diagnosis. On the BUS-BRA dataset, Med-CTX achieves a Dice score of 99% and an IoU of 95%, beating existing baselines U-Net, ViT, and Swin. Clinical text plays a key role in segmentation accuracy and explanation quality, as evidenced by ablation studies that show a -5.4% decline in Dice score and -31% in CIDEr. Med-CTX achieves good multimodal alignment (CLIP score: 85%) and increased confi dence calibration (ECE: 3.2%), setting a new bar for trustworthy, multimodal medical architecture.

[126] Timestep-Compressed Attack on Spiking Neural Networks through Timestep-Level Backpropagation

Donghwa Kang, Doohyun Kim, Sang-Ki Ko, Jinkyu Lee, Hyeongboo Baek, Brent ByungHoon Kang

Main category: cs.CV

TL;DR: TCA framework reduces attack latency in SNNs by 56-57% through timestep-level backpropagation and adversarial membrane potential reuse, maintaining comparable attack success rates.

DetailsMotivation: Current gradient-based adversarial attacks on SNNs suffer from high latency due to multi-timestep processing, making them impractical for real-time applications as they fail to exploit SNN-specific properties.

Method: Proposes Timestep-Compressed Attack (TCA) with two components: 1) Timestep-Level Backpropagation (TLBP) for per-timestep evaluation and early stopping, and 2) Adversarial Membrane Potential Reuse (A-MPR) to pre-calculate and reuse warm-up phase membrane potentials.

Result: TCA reduces required attack latency by up to 56.6% (white-box) and 57.1% (black-box) compared to SOTA methods while maintaining comparable attack success rates on VGG-11 and ResNet-17 with CIFAR-10/100 and CIFAR10-DVS datasets.

Conclusion: TCA successfully addresses the latency inefficiency in SNN adversarial attacks by leveraging SNN-specific temporal properties, making adversarial attacks more practical for real-time applications.

Abstract: State-of-the-art (SOTA) gradient-based adversarial attacks on spiking neural networks (SNNs), which largely rely on extending FGSM and PGD frameworks, face a critical limitation: substantial attack latency from multi-timestep processing, rendering them infeasible for practical real-time applications. This inefficiency stems from their design as direct extensions of ANN paradigms, which fail to exploit key SNN properties. In this paper, we propose the timestep-compressed attack (TCA), a novel framework that significantly reduces attack latency. TCA introduces two components founded on key insights into SNN behavior. First, timestep-level backpropagation (TLBP) is based on our finding that global temporal information in backpropagation to generate perturbations is not critical for an attack’s success, enabling per-timestep evaluation for early stopping. Second, adversarial membrane potential reuse (A-MPR) is motivated by the observation that initial timesteps are inefficiently spent accumulating membrane potential, a warm-up phase that can be pre-calculated and reused. Our experiments on VGG-11 and ResNet-17 with the CIFAR-10/100 and CIFAR10-DVS datasets show that TCA significantly reduces the required attack latency by up to 56.6% and 57.1% compared to SOTA methods in white-box and black-box settings, respectively, while maintaining a comparable attack success rate.

[127] Unsupervised Urban Tree Biodiversity Mapping from Street-Level Imagery Using Spatially-Aware Visual Clustering

Diaa Addeen Abuhani, Marco Seccaroni, Martina Mazzarello, Imran Zualkernan, Fabio Duarte, Carlo Ratti

Main category: cs.CV

TL;DR: Unsupervised AI framework using street imagery and spatial patterns to estimate urban tree biodiversity without labels, achieving high accuracy across multiple cities.

DetailsMotivation: Urban tree biodiversity is crucial for climate resilience but current methods (field inventories and supervised AI) are costly, time-consuming, and don't generalize well across regions.

Method: Unsupervised clustering framework that integrates visual embeddings from street-level imagery with spatial planting patterns to estimate biodiversity without requiring labeled data.

Result: Applied to eight North American cities, the method recovered genus-level diversity patterns with high fidelity, achieving low Wasserstein distances to ground truth for Shannon and Simpson indices while preserving spatial autocorrelation.

Conclusion: This scalable, fine-grained approach enables biodiversity mapping in cities lacking detailed inventories and supports continuous, low-cost monitoring for equitable greenery access and adaptive urban ecosystem management.

Abstract: Urban tree biodiversity is critical for climate resilience, ecological stability, and livability in cities, yet most municipalities lack detailed knowledge of their canopies. Field-based inventories provide reliable estimates of Shannon and Simpson diversity but are costly and time-consuming, while supervised AI methods require labeled data that often fail to generalize across regions. We introduce an unsupervised clustering framework that integrates visual embeddings from street-level imagery with spatial planting patterns to estimate biodiversity without labels. Applied to eight North American cities, the method recovers genus-level diversity patterns with high fidelity, achieving low Wasserstein distances to ground truth for Shannon and Simpson indices and preserving spatial autocorrelation. This scalable, fine-grained approach enables biodiversity mapping in cities lacking detailed inventories and offers a pathway for continuous, low-cost monitoring to support equitable access to greenery and adaptive management of urban ecosystems.

[128] Self-Aware Adaptive Alignment: Enabling Accurate Perception for Intelligent Transportation Systems

Tong Xiang, Hongxia Zhao, Fenghua Zhu, Yuanyuan Chen, Yisheng Lv

Main category: cs.CV

TL;DR: SA3 is a novel cross-domain object detection method that uses attention-based alignment and instance-to-image level adaptation to bridge domain gaps between source and target domains in intelligent transportation systems.

DetailsMotivation: Cross-domain detection in intelligent transportation faces significant challenges due to domain shifts between source and target environments, requiring effective adaptation mechanisms to maintain detection performance.

Method: Proposes Self-Aware Adaptive Alignment (SA3) with attention-based alignment module trained on both domains, channel importance re-weighting, region proposal network integration, and instance-to-image level alignment specific to target domain.

Result: Extensive experiments on cross-domain object detection benchmarks show SA3 achieves superior performance compared to previous state-of-the-art methods.

Conclusion: SA3 effectively addresses cross-domain detection challenges through its adaptive alignment mechanism, demonstrating significant improvements in intelligent transportation detection tasks.

Abstract: Achieving top-notch performance in Intelligent Transportation detection is a critical research area. However, many challenges still need to be addressed when it comes to detecting in a cross-domain scenario. In this paper, we propose a Self-Aware Adaptive Alignment (SA3), by leveraging an efficient alignment mechanism and recognition strategy. Our proposed method employs a specified attention-based alignment module trained on source and target domain datasets to guide the image-level features alignment process, enabling the local-global adaptive alignment between the source domain and target domain. Features from both domains, whose channel importance is re-weighted, are fed into the region proposal network, which facilitates the acquisition of salient region features. Also, we introduce an instance-to-image level alignment module specific to the target domain to adaptively mitigate the domain gap. To evaluate the proposed method, extensive experiments have been conducted on popular cross-domain object detection benchmarks. Experimental results show that SA3 achieves superior results to the previous state-of-the-art methods.

[129] SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation

Paul Grimal, Michaël Soumm, Hervé Le Borgne, Olivier Ferret, Akihiro Sugimoto

Main category: cs.CV

TL;DR: A training-free method that learns high-success-rate distributions for text-to-image generation to ensure precise alignment with prompts, preventing missing elements and concept blending while supporting additional spatial conditioning.

DetailsMotivation: Current text-to-image models produce visually impressive results but often fail to precisely align with text prompts, leading to missing critical elements or unintended blending of distinct concepts.

Method: Proposes a novel approach that learns a high-success-rate distribution conditioned on target prompts, explicitly modeling signal components during denoising to provide fine-grained control. The training-free framework integrates with diffusion and flow matching architectures and supports additional conditioning like bounding boxes.

Result: Extensive experiments demonstrate that the approach outperforms current state-of-the-art methods in text-to-image alignment and faithfulness to prompts.

Conclusion: The proposed method effectively addresses text-to-image alignment issues, offering improved precision, reduced artifacts, and enhanced spatial control while maintaining compatibility with existing architectures.

Abstract: State-of-the-art text-to-image models produce visually impressive results but often struggle with precise alignment to text prompts, leading to missing critical elements or unintended blending of distinct concepts. We propose a novel approach that learns a high-success-rate distribution conditioned on a target prompt, ensuring that generated images faithfully reflect the corresponding prompts. Our method explicitly models the signal component during the denoising process, offering fine-grained control that mitigates over-optimization and out-of-distribution artifacts. Moreover, our framework is training-free and seamlessly integrates with both existing diffusion and flow matching architectures. It also supports additional conditioning modalities – such as bounding boxes – for enhanced spatial alignment. Extensive experiments demonstrate that our approach outperforms current state-of-the-art methods. The code is available at https://github.com/grimalPaul/gsn-factory.

[130] RICO: Two Realistic Benchmarks and an In-Depth Analysis for Incremental Learning in Object Detection

Matthias Neuwirth-Trapp, Maarten Bieshaar, Danda Pani Paudel, Luc Van Gool

Main category: cs.CV

TL;DR: Two realistic incremental object detection benchmarks (RICO) introduced to address limitations of synthetic IL evaluations, showing current IL methods underperform compared to simple replay and individual training approaches.

DetailsMotivation: Existing incremental learning evaluations rely on synthetic benchmarks that obscure real-world performance, creating a need for realistic benchmarks that capture domain shifts, new classes, and diverse real-world conditions.

Method: Created two benchmarks: Domain RICO (fixed classes with domain shifts) and Expanding-Classes RICO (new domains and classes per step), built from 14 diverse datasets covering real/synthetic domains, varying conditions, camera sensors, and labeling policies.

Result: All IL methods underperformed in both adaptability and retention; replaying small amounts of previous data outperformed all IL methods, but individual training on the data remained superior.

Conclusion: Current IL methods have significant gaps attributed to weak teachers in distillation, single models’ inability to manage diverse tasks, and insufficient plasticity - highlighting the need for improved IL approaches for real-world scenarios.

Abstract: Incremental Learning (IL) trains models sequentially on new data without full retraining, offering privacy, efficiency, and scalability. IL must balance adaptability to new data with retention of old knowledge. However, evaluations often rely on synthetic, simplified benchmarks, obscuring real-world IL performance. To address this, we introduce two Realistic Incremental Object Detection Benchmarks (RICO): Domain RICO (D-RICO) features domain shifts with a fixed class set, and Expanding-Classes RICO (EC-RICO) integrates new domains and classes per IL step. Built from 14 diverse datasets covering real and synthetic domains, varying conditions (e.g., weather, time of day), camera sensors, perspectives, and labeling policies, both benchmarks capture challenges absent in existing evaluations. Our experiments show that all IL methods underperform in adaptability and retention, while replaying a small amount of previous data already outperforms all methods. However, individual training on the data remains superior. We heuristically attribute this gap to weak teachers in distillation, single models’ inability to manage diverse tasks, and insufficient plasticity. Our code will be made publicly available.

[131] In-hoc Concept Representations to Regularise Deep Learning in Medical Imaging

Valentina Corbetta, Floris Six Dijkstra, Regina Beets-Tan, Hoel Kervadec, Kristoffer Wickstrøm, Wilson Silva

Main category: cs.CV

TL;DR: LCRReg is a novel regularization method that uses Latent Concept Representations to guide medical imaging models toward clinically meaningful features instead of spurious correlations, improving robustness without requiring concept labels in the main training data.

DetailsMotivation: Deep learning models in medical imaging often rely on spurious correlations rather than clinically meaningful features, leading to poor generalization under distribution shifts.

Method: Uses a small auxiliary dataset to synthesize concept examples, extracts Latent Concept Representations (LCRs) for relevant features, and incorporates a regularization term that guides CNNs to activate within concept-associated latent subspaces.

Result: Significantly improves robustness to spurious correlations in controlled experiments and enhances performance on diabetic retinopathy classification under both synthetic perturbations and out-of-distribution scenarios.

Conclusion: LCRReg provides a lightweight, architecture-agnostic strategy for improving model robustness without dense concept supervision, outperforming baselines like multitask learning and concept-based models.

Abstract: Deep learning models in medical imaging often achieve strong in-distribution performance but struggle to generalise under distribution shifts, frequently relying on spurious correlations instead of clinically meaningful features. We introduce LCRReg, a novel regularisation approach that leverages Latent Concept Representations (LCRs) (e.g., Concept Activation Vectors (CAVs)) to guide models toward semantically grounded representations. LCRReg requires no concept labels in the main training set and instead uses a small auxiliary dataset to synthesise high-quality, disentangled concept examples. We extract LCRs for predefined relevant features, and incorporate a regularisation term that guides a Convolutional Neural Network (CNN) to activate within latent subspaces associated with those concepts. We evaluate LCRReg across synthetic and real-world medical tasks. On a controlled toy dataset, it significantly improves robustness to injected spurious correlations and remains effective even in multi-concept and multiclass settings. On the diabetic retinopathy binary classification task, LCRReg enhances performance under both synthetic spurious perturbations and out-of-distribution (OOD) generalisation. Compared to baselines, including multitask learning, linear probing, and post-hoc concept-based models, LCRReg offers a lightweight, architecture-agnostic strategy for improving model robustness without requiring dense concept supervision. Code is available at the following link: https://github.com/Trustworthy-AI-UU-NKI/lcr_regularization

[132] Forecasting Smog Events Using ConvLSTM: A Spatio-Temporal Approach for Aerosol Index Prediction in South Asia

Taimur Khan

Main category: cs.CV

TL;DR: This study develops a ConvLSTM neural network to forecast South Asian smog events using Sentinel-5P aerosol data, achieving good predictive performance for 5-day aerosol index forecasts.

DetailsMotivation: South Asian smog events cause severe air pollution with significant socio-economic impacts, but real-time forecasting systems for particulate matter concentrations are lacking at regional scales.

Method: Used Sentinel-5P air constituent data (2019-2023) and a Convolutional Long-Short Term Memory (ConvLSTM) neural network to capture spatial and temporal correlations, with UV Aerosol Index at 340-380 nm as predictor.

Result: Achieved Aerosol Index forecasting at five-day intervals with Mean Squared Error of ~0.0018, loss of ~0.3995, and Structural Similarity Index of ~0.74.

Conclusion: The ConvLSTM model effectively forecasts aerosol events but can be improved by integrating additional data and refining architecture.

Abstract: The South Asian Smog refers to the recurring annual air pollution events marked by high contaminant levels, reduced visibility, and significant socio-economic impacts, primarily affecting the Indo-Gangetic Plains (IGP) from November to February. Over the past decade, increased air pollution sources such as crop residue burning, motor vehicles, and changing weather patterns have intensified these smog events. However, real-time forecasting systems for increased particulate matter concentrations are still not established at regional scale. The Aerosol Index, closely tied to smog formation and a key component in calculating the Air Quality Index (AQI), reflects particulate matter concentrations. This study forecasts aerosol events using Sentinel-5P air constituent data (2019-2023) and a Convolutional Long-Short Term Memory (ConvLSTM) neural network, which captures spatial and temporal correlations more effectively than previous models. Using the Ultraviolet (UV) Aerosol Index at 340-380 nm as the predictor, results show the Aerosol Index can be forecasted at five-day intervals with a Mean Squared Error of ~0.0018, loss of ~0.3995, and Structural Similarity Index of ~0.74. While effective, the model can be improved by integrating additional data and refining its architecture.

[133] SCRNet: Spatial-Channel Regulation Network for Medical Ultrasound Image Segmentation

Weixin Xu, Ziliang Wang

Main category: cs.CV

TL;DR: Proposes SCRNet, a novel UNet-based framework with Feature Aggregation Module and Spatial-Channel Regulation Module that combines convolution and cross-attention to address both long-range dependencies and local context in medical ultrasound segmentation.

DetailsMotivation: Traditional CNN-based methods ignore long-range dependencies while Transformer-based methods overlook local contextual information in medical ultrasound image segmentation, creating a need for a hybrid approach.

Method: Developed Feature Aggregation Module (FAM) with Convolution and Cross-Attention Parallel Module (CCAPM) to process input features, integrated within Spatial-Channel Regulation Module (SCRM) and incorporated into UNet encoder to create SCRNet framework.

Result: Extensive experiments demonstrate SCRNet consistently achieves state-of-the-art performance compared to existing methods in medical ultrasound image segmentation.

Conclusion: The proposed SCRNet framework successfully addresses limitations of both CNN and Transformer approaches by effectively capturing both long-range dependencies and local contextual information through its novel module design.

Abstract: Medical ultrasound image segmentation presents a formidable challenge in the realm of computer vision. Traditional approaches rely on Convolutional Neural Networks (CNNs) and Transformer-based methods to address the intricacies of medical image segmentation. Nevertheless, inherent limitations persist, as CNN-based methods tend to disregard long-range dependencies, while Transformer-based methods may overlook local contextual information. To address these deficiencies, we propose a novel Feature Aggregation Module (FAM) designed to process two input features from the preceding layer. These features are seamlessly directed into two branches of the Convolution and Cross-Attention Parallel Module (CCAPM) to endow them with different roles in each of the two branches to help establish a strong connection between the two input features. This strategy enables our module to focus concurrently on both long-range dependencies and local contextual information by judiciously merging convolution operations with cross-attention mechanisms. Moreover, by integrating FAM within our proposed Spatial-Channel Regulation Module (SCRM), the ability to discern salient regions and informative features warranting increased attention is enhanced. Furthermore, by incorporating the SCRM into the encoder block of the UNet architecture, we introduce a novel framework dubbed Spatial-Channel Regulation Network (SCRNet). The results of our extensive experiments demonstrate the superiority of SCRNet, which consistently achieves state-of-the-art (SOTA) performance compared to existing methods.

[134] PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis

Chunji Lv, Zequn Chen, Donglin Di, Weinan Zhang, Hao Li, Wei Chen, Changsheng Li

Main category: cs.CV

TL;DR: PhysGM is a feed-forward framework that jointly predicts 3D Gaussian representations and physical properties from single images, enabling fast 4D rendering and physical simulation without relying on pre-reconstructed assets or unstable optimization methods.

DetailsMotivation: Current physics-grounded 3D motion synthesis methods rely on pre-reconstructed 3DGS representations and face limitations with inflexible physical attributes or unstable video model guidance, requiring better integration of physics prediction and rendering.

Method: Joint optimization of Gaussian reconstruction and probabilistic physics prediction, refined with physically plausible reference videos using Direct Preference Optimization (DPO) to avoid complex SDS optimization. Trained on PhysAssets dataset with 24,000+ 3D assets.

Result: Generates high-fidelity 4D simulations from single images in one minute, achieving significant speedup over prior works while maintaining realistic rendering quality.

Conclusion: PhysGM provides an efficient feed-forward solution for joint 3D representation and physics prediction, overcoming limitations of previous methods and enabling fast, high-quality physical simulations from single images.

Abstract: While physics-grounded 3D motion synthesis has seen significant progress, current methods face critical limitations. They typically rely on pre-reconstructed 3D Gaussian Splatting (3DGS) representations, while physics integration depends on either inflexible, manually defined physical attributes or unstable, optimization-heavy guidance from video models. To overcome these challenges, we introduce PhysGM, a feed-forward framework that jointly predicts a 3D Gaussian representation and its physical properties from a single image, enabling immediate, physical simulation and high-fidelity 4D rendering. We first establish a base model by jointly optimizing for Gaussian reconstruction and probabilistic physics prediction. The model is then refined with physically plausible reference videos to enhance both rendering fidelity and physics prediction accuracy. We adopt the Direct Preference Optimization (DPO) to align its simulations with reference videos, circumventing Score Distillation Sampling (SDS) optimization which needs back-propagating gradients through the complex differentiable simulation and rasterization. To facilitate the training, we introduce a new dataset PhysAssets of over 24,000 3D assets, annotated with physical properties and corresponding guiding videos. Experimental results demonstrate that our method effectively generates high-fidelity 4D simulations from a single image in one minute. This represents a significant speedup over prior works while delivering realistic rendering results. Our project page is at:https://hihixiaolv.github.io/PhysGM.github.io/

[135] DIME-Net: A Dual-Illumination Adaptive Enhancement Network Based on Retinex and Mixture-of-Experts

Ziang Wang, Xiaoqin Wang, Dingyi Wang, Qiang Li, Shushan Qiao

Main category: cs.CV

TL;DR: DIME-Net is a dual-illumination enhancement framework that handles both low-light and backlit images through adaptive expert selection and damage restoration modules, achieving robust performance across diverse lighting conditions without retraining.

DetailsMotivation: Existing methods focus on single illumination degradation types and lack unified handling of diverse lighting conditions like low-light and backlit scenarios that commonly degrade image quality in real-world environments.

Method: Proposes a Mixture-of-Experts illumination estimator with sparse gating to adaptively select S-curve expert networks based on input characteristics, integrated with Retinex theory. Includes a damage restoration module with Illumination-Aware Cross Attention and Sequential-State Global Attention to correct artifacts and color distortions. Uses a hybrid MixBL dataset for training.

Result: Achieves competitive performance on both synthetic and real-world low-light and backlit datasets without retraining, demonstrating strong generalization across diverse illumination conditions.

Conclusion: DIME-Net shows effective unified handling of multiple illumination degradations with robust adaptability, making it suitable for practical multimedia applications under complex lighting scenarios.

Abstract: Image degradation caused by complex lighting conditions such as low-light and backlit scenarios is commonly encountered in real-world environments, significantly affecting image quality and downstream vision tasks. Most existing methods focus on a single type of illumination degradation and lack the ability to handle diverse lighting conditions in a unified manner. To address this issue, we propose a dual-illumination enhancement framework called DIME-Net. The core of our method is a Mixture-of-Experts illumination estimator module, where a sparse gating mechanism adaptively selects suitable S-curve expert networks based on the illumination characteristics of the input image. By integrating Retinex theory, this module effectively performs enhancement tailored to both low-light and backlit images. To further correct illumination-induced artifacts and color distortions, we design a damage restoration module equipped with Illumination-Aware Cross Attention and Sequential-State Global Attention mechanisms. In addition, we construct a hybrid illumination dataset, MixBL, by integrating existing datasets, allowing our model to achieve robust illumination adaptability through a single training process. Experimental results show that DIME-Net achieves competitive performance on both synthetic and real-world low-light and backlit datasets without any retraining. These results demonstrate its generalization ability and potential for practical multimedia applications under diverse and complex illumination conditions.

[136] ViT-FIQA: Assessing Face Image Quality using Vision Transformers

Andrea Atzori, Fadi Boutros, Naser Damer

Main category: cs.CV

TL;DR: ViT-FIQA is a novel face image quality assessment method that extends Vision Transformer backbones with a learnable quality token to predict face recognition utility scores, achieving state-of-the-art performance across various benchmarks.

DetailsMotivation: Current FIQA methods primarily rely on CNNs, leaving the potential of Vision Transformer architectures underexplored for face image quality assessment tasks.

Method: Extends standard ViT backbones with a learnable quality token concatenated with image patch tokens, processed via global self-attention. Uses two output heads: one for face representation learning and another for quality score regression.

Result: Extensive experiments show ViT-FIQA consistently achieves top-tier performance on challenging benchmarks across both CNN- and ViT-based face recognition models.

Conclusion: Transformer-based architectures are highly effective for modeling face image utility, and ViTs serve as a scalable foundation for future FIQA research.

Abstract: Face Image Quality Assessment (FIQA) aims to predict the utility of a face image for face recognition (FR) systems. State-of-the-art FIQA methods mainly rely on convolutional neural networks (CNNs), leaving the potential of Vision Transformer (ViT) architectures underexplored. This work proposes ViT-FIQA, a novel approach that extends standard ViT backbones, originally optimized for FR, through a learnable quality token designed to predict a scalar utility score for any given face image. The learnable quality token is concatenated with the standard image patch tokens, and the whole sequence is processed via global self-attention by the ViT encoders to aggregate contextual information across all patches. At the output of the backbone, ViT-FIQA branches into two heads: (1) the patch tokens are passed through a fully connected layer to learn discriminative face representations via a margin-penalty softmax loss, and (2) the quality token is fed into a regression head to learn to predict the face sample’s utility. Extensive experiments on challenging benchmarks and several FR models, including both CNN- and ViT-based architectures, demonstrate that ViT-FIQA consistently achieves top-tier performance. These results underscore the effectiveness of transformer-based architectures in modeling face image utility and highlight the potential of ViTs as a scalable foundation for future FIQA research https://cutt.ly/irHlzXUC.

[137] ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving

Xianda Guo, Ruijun Zhang, Yiqun Duan, Ruilin Wang, Keyuan Zhou, Wenzhao Zheng, Wenke Huang, Gangwei Xu, Mike Horton, Yuan Si, Hao Zhao, Long Chen

Main category: cs.CV

TL;DR: New large-scale diverse depth estimation dataset for autonomous driving with 20K frames, addressing limitations of existing datasets through cost-efficient acquisition and sparse ground truth.

DetailsMotivation: Existing depth datasets like KITTI, nuScenes, and DDAD have limitations in diversity and scalability, with benchmark performance approaching saturation. Need for new large-scale datasets to support foundation models and multi-modal learning.

Method: Created a large-scale diverse dataset with 20K video frames using lightweight acquisition pipeline for broad scene coverage at low cost. Uses sparse but statistically sufficient ground truth for robust training.

Result: Dataset presents greater diversity in driving scenarios and lower depth density compared to existing datasets. Benchmark experiments show substantial performance gaps in challenging conditions.

Conclusion: Establishes a new platform for advancing depth estimation research with a dataset that creates new challenges for generalization and supports the development of foundation models.

Abstract: Depth estimation is a fundamental task for 3D scene understanding in autonomous driving, robotics, and augmented reality. Existing depth datasets, such as KITTI, nuScenes, and DDAD, have advanced the field but suffer from limitations in diversity and scalability. As benchmark performance on these datasets approaches saturation, there is an increasing need for a new generation of large-scale, diverse, and cost-efficient datasets to support the era of foundation models and multi-modal learning. To address these challenges, we introduce a large-scale, diverse, frame-wise continuous dataset for depth estimation in dynamic outdoor driving environments, comprising 20K video frames to evaluate existing methods. Our lightweight acquisition pipeline ensures broad scene coverage at low cost, while sparse yet statistically sufficient ground truth enables robust training. Compared to existing datasets, ours presents greater diversity in driving scenarios and lower depth density, creating new challenges for generalization. Benchmark experiments with standard monocular depth estimation models validate the dataset’s utility and highlight substantial performance gaps in challenging conditions, establishing a new platform for advancing depth estimation research.

[138] OmViD: Omni-supervised active learning for video action detection

Aayush Rana, Akash Kumar, Vibhav Vineet, Yogesh S Rawat

Main category: cs.CV

TL;DR: This paper analyzes different annotation types for video action detection and proposes an active learning strategy to determine appropriate annotation levels per video, plus a 3D-superpixel method to generate pseudo-labels, reducing annotation costs with minimal performance impact.

DetailsMotivation: Video action detection requires dense spatio-temporal annotations that are challenging and expensive to obtain, while videos vary in difficulty and may not need the same annotation level.

Method: Proposes active learning strategy to estimate necessary annotation type per video, and introduces spatio-temporal 3D-superpixel approach to generate pseudo-labels from various annotation types (tags, points, scribbles, boxes, masks).

Result: Validated on UCF101-24 and JHMDB-21 datasets, significantly reducing annotation costs while maintaining minimal performance loss.

Conclusion: The approach effectively addresses annotation cost challenges in video action detection by adaptively selecting appropriate annotation types and generating pseudo-labels for efficient training.

Abstract: Video action detection requires dense spatio-temporal annotations, which are both challenging and expensive to obtain. However, real-world videos often vary in difficulty and may not require the same level of annotation. This paper analyzes the appropriate annotation types for each sample and their impact on spatio-temporal video action detection. It focuses on two key aspects: 1) how to obtain varying levels of annotation for videos, and 2) how to learn action detection from different annotation types. The study explores video-level tags, points, scribbles, bounding boxes, and pixel-level masks. First, a simple active learning strategy is proposed to estimate the necessary annotation type for each video. Then, a novel spatio-temporal 3D-superpixel approach is introduced to generate pseudo-labels from these annotations, enabling effective training. The approach is validated on UCF101-24 and JHMDB-21 datasets, significantly cutting annotation costs with minimal performance loss.

[139] Physics-Based 3D Simulation for Synthetic Data Generation and Failure Analysis in Packaging Stability Assessment

Samuel Seligardi, Pietro Musoni, Eleonora Iotti, Gianluca Contesso, Alessandro Dal Palù

Main category: cs.CV

TL;DR: A simulation system for pallet safety testing using 3D graphics and deep learning to predict crash outcomes, reducing physical testing needs.

DetailsMotivation: Rising logistics demands require automated safety systems, and plastic wrapping environmental concerns drive need for eco-friendly alternatives that maintain safety standards.

Method: Developed a fully controllable 3D virtual simulation environment that replicates pallet behavior with various configurations, materials, and dynamic conditions. Trained a deep neural network to analyze rendered videos as a crash-test predictor.

Result: Created an accurate physical simulation system that reduces physical testing requirements, cuts costs and environmental impact, while improving measurement accuracy for pallet dynamics analysis.

Conclusion: The simulation system combined with deep learning video analysis provides an effective tool for safety evaluation of pallet configurations, offering both economic and environmental benefits over traditional physical testing methods.

Abstract: The design and analysis of pallet setups are essential for ensuring safety of packages transportation. With rising demands in the logistics sector, the development of automated systems utilizing advanced technologies has become increasingly crucial. Moreover, the widespread use of plastic wrapping has motivated researchers to investigate eco-friendly alternatives that still adhere to safety standards. We present a fully controllable and accurate physical simulation system capable of replicating the behavior of moving pallets. It features a 3D graphics-based virtual environment that supports a wide range of configurations, including variable package layouts, different wrapping materials, and diverse dynamic conditions. This innovative approach reduces the need for physical testing, cutting costs and environmental impact while improving measurement accuracy for analyzing pallet dynamics. Additionally, we train a deep neural network to evaluate the rendered videos generated by our simulator, as a crash-test predictor for pallet configurations, further enhancing the system’s utility in safety analysis.

[140] Self-Supervised Sparse Sensor Fusion for Long Range Perception

Edoardo Palladin, Samuel Brucker, Filippo Ghilotti, Praveen Narayanan, Mario Bijelic, Felix Heide

Main category: cs.CV

TL;DR: Efficient 3D perception system extends autonomous vehicle perception range to 250m for highway driving, achieving 26.6% mAP improvement in object detection and 30.5% reduction in LiDAR forecasting error.

DetailsMotivation: Autonomous vehicles need longer perception ranges (250m+) for safe highway driving at high speeds, especially for large trucks with high inertia. Existing BEV approaches have quadratic cost increases with distance.

Method: Built on sparse representation with efficient 3D encoding of multi-modal and temporal features, plus novel self-supervised pre-training using unlabeled camera-LiDAR data.

Result: Achieved 250m perception range with 26.6% mAP improvement in object detection and 30.5% decrease in Chamfer Distance for LiDAR forecasting compared to existing methods.

Conclusion: The approach successfully overcomes range limitations of existing perception systems, enabling safe long-distance highway autonomy for both passenger vehicles and large trucks.

Abstract: Outside of urban hubs, autonomous cars and trucks have to master driving on intercity highways. Safe, long-distance highway travel at speeds exceeding 100 km/h demands perception distances of at least 250 m, which is about five times the 50-100m typically addressed in city driving, to allow sufficient planning and braking margins. Increasing the perception ranges also allows to extend autonomy from light two-ton passenger vehicles to large-scale forty-ton trucks, which need a longer planning horizon due to their high inertia. However, most existing perception approaches focus on shorter ranges and rely on Bird’s Eye View (BEV) representations, which incur quadratic increases in memory and compute costs as distance grows. To overcome this limitation, we built on top of a sparse representation and introduced an efficient 3D encoding of multi-modal and temporal features, along with a novel self-supervised pre-training scheme that enables large-scale learning from unlabeled camera-LiDAR data. Our approach extends perception distances to 250 meters and achieves an 26.6% improvement in mAP in object detection and a decrease of 30.5% in Chamfer Distance in LiDAR forecasting compared to existing methods, reaching distances up to 250 meters. Project Page: https://light.princeton.edu/lrs4fusion/

[141] ResPlan: A Large-Scale Vector-Graph Dataset of 17,000 Residential Floor Plans

Mohamed Abouagour, Eleftherios Garyfallidis

Main category: cs.CV

TL;DR: ResPlan is a large-scale dataset of 17,000 detailed residential floor plans with precise architectural annotations, addressing limitations of existing datasets with enhanced visual fidelity and structural diversity.

DetailsMotivation: To overcome key limitations of existing floor plan datasets like RPLAN and MSD by providing more realistic, structurally diverse residential layouts with better visual fidelity for spatial AI research.

Method: Created a dataset of 17,000 detailed floor plans with precise annotations of architectural elements and functional spaces. Developed an open-source pipeline for geometry cleaning, alignment, and annotation refinement. Provided plans in both geometric and graph-based formats.

Result: A comprehensive dataset that supports diverse applications including robotics, reinforcement learning, generative AI, VR/AR, simulations, and game development. Includes structured room connectivity representations for graph-based spatial reasoning.

Conclusion: ResPlan represents a significant advance in scale, realism, and usability, providing a robust foundation for developing and benchmarking next-generation spatial intelligence systems with comparative analyses and open benchmark tasks.

Abstract: We introduce ResPlan, a large-scale dataset of 17,000 detailed, structurally rich, and realistic residential floor plans, created to advance spatial AI research. Each plan includes precise annotations of architectural elements (walls, doors, windows, balconies) and functional spaces (such as kitchens, bedrooms, and bathrooms). ResPlan addresses key limitations of existing datasets such as RPLAN (Wu et al., 2019) and MSD (van Engelenburg et al., 2024) by offering enhanced visual fidelity and greater structural diversity, reflecting realistic and non-idealized residential layouts. Designed as a versatile, general-purpose resource, ResPlan supports a wide range of applications including robotics, reinforcement learning, generative AI, virtual and augmented reality, simulations, and game development. Plans are provided in both geometric and graph-based formats, enabling direct integration into simulation engines and fast 3D conversion. A key contribution is an open-source pipeline for geometry cleaning, alignment, and annotation refinement. Additionally, ResPlan includes structured representations of room connectivity, supporting graph-based spatial reasoning tasks. Finally, we present comparative analyses with existing benchmarks and outline several open benchmark tasks enabled by ResPlan. Ultimately, ResPlan offers a significant advance in scale, realism, and usability, providing a robust foundation for developing and benchmarking next-generation spatial intelligence systems.

[142] Online 3D Gaussian Splatting Modeling with Novel View Selection

Byeonggwon Lee, Junkyu Park, Khang Truong Giang, Soohwan Song

Main category: cs.CV

TL;DR: A novel method for online 3D Gaussian Splatting that improves scene completeness through adaptive view selection of both keyframes and optimal non-keyframes, outperforming state-of-the-art methods in complex outdoor scenes.

DetailsMotivation: Existing methods rely solely on keyframes which are insufficient for complete scene reconstruction, and online processing constraints limit the use of many frames or extensive training iterations for generalizable models.

Method: Proposes adaptive view selection that analyzes reconstruction quality online to choose optimal non-keyframes for additional training, integrating both keyframes and selected non-keyframes to refine incomplete regions from diverse viewpoints. Incorporates an online multi-view stereo approach for consistent 3D information throughout the modeling process.

Result: The method demonstrates superior performance compared to state-of-the-art methods, delivering exceptional results in complex outdoor scenes with significantly enhanced completeness.

Conclusion: The proposed approach effectively addresses the limitations of keyframe-only methods by adaptively selecting optimal views, resulting in high-quality 3DGS models with improved scene completeness while maintaining online processing constraints.

Abstract: This study addresses the challenge of generating online 3D Gaussian Splatting (3DGS) models from RGB-only frames. Previous studies have employed dense SLAM techniques to estimate 3D scenes from keyframes for 3DGS model construction. However, these methods are limited by their reliance solely on keyframes, which are insufficient to capture an entire scene, resulting in incomplete reconstructions. Moreover, building a generalizable model requires incorporating frames from diverse viewpoints to achieve broader scene coverage. However, online processing restricts the use of many frames or extensive training iterations. Therefore, we propose a novel method for high-quality 3DGS modeling that improves model completeness through adaptive view selection. By analyzing reconstruction quality online, our approach selects optimal non-keyframes for additional training. By integrating both keyframes and selected non-keyframes, the method refines incomplete regions from diverse viewpoints, significantly enhancing completeness. We also present a framework that incorporates an online multi-view stereo approach, ensuring consistency in 3D information throughout the 3DGS modeling process. Experimental results demonstrate that our method outperforms state-of-the-art methods, delivering exceptional performance in complex outdoor scenes.

[143] Backdooring Self-Supervised Contrastive Learning by Noisy Alignment

Tuo Chen, Jie Gui, Minjing Dong, Ju Jia, Lanting Fang, Jian Liu

Main category: cs.CV

TL;DR: Noisy Alignment (NA) is a data poisoning backdoor attack method for self-supervised contrastive learning that explicitly suppresses noise components in poisoned images through strategic manipulation of random cropping, achieving state-of-the-art performance while maintaining clean-data accuracy.

DetailsMotivation: Existing data poisoning backdoor attacks (DPCLs) for contrastive learning suffer from limited efficacy due to fragile implicit co-occurrence dependencies and inadequate suppression of discriminative features in backdoored images.

Method: The method identifies and extracts the critical objective of noisy alignment from training-controllable CL attacks, implements it by strategically manipulating contrastive learning’s random cropping mechanism, and formulates this as an image layout optimization problem with theoretically derived optimal parameters.

Result: Noisy Alignment achieves state-of-the-art performance compared to existing DPCLs, maintains clean-data accuracy, and demonstrates robustness against common backdoor defenses.

Conclusion: The proposed Noisy Alignment method provides a simple yet effective approach for data poisoning backdoor attacks in contrastive learning by explicitly addressing noise component suppression through optimized random cropping manipulation.

Abstract: Self-supervised contrastive learning (CL) effectively learns transferable representations from unlabeled data containing images or image-text pairs but suffers vulnerability to data poisoning backdoor attacks (DPCLs). An adversary can inject poisoned images into pretraining datasets, causing compromised CL encoders to exhibit targeted misbehavior in downstream tasks. Existing DPCLs, however, achieve limited efficacy due to their dependence on fragile implicit co-occurrence between backdoor and target object and inadequate suppression of discriminative features in backdoored images. We propose Noisy Alignment (NA), a DPCL method that explicitly suppresses noise components in poisoned images. Inspired by powerful training-controllable CL attacks, we identify and extract the critical objective of noisy alignment, adapting it effectively into data-poisoning scenarios. Our method implements noisy alignment by strategically manipulating contrastive learning’s random cropping mechanism, formulating this process as an image layout optimization problem with theoretically derived optimal parameters. The resulting method is simple yet effective, achieving state-of-the-art performance compared to existing DPCLs, while maintaining clean-data accuracy. Furthermore, Noisy Alignment demonstrates robustness against common backdoor defenses. Codes can be found at https://github.com/jsrdcht/Noisy-Alignment.

[144] InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing

Shaoshu Yang, Zhe Kong, Feng Gao, Meng Cheng, Xiangyu Liu, Yong Zhang, Zhuoliang Kang, Wenhan Luo, Xunliang Cai, Ran He, Xiaoming Wei

Main category: cs.CV

TL;DR: Sparse-frame video dubbing paradigm that preserves reference keyframes for identity/gesture preservation while enabling audio-synchronized full-body motion editing, overcoming limitations of conventional mouth-only dubbing.

DetailsMotivation: Conventional video dubbing techniques are limited to mouth region editing, causing discordant facial expressions and body gestures that compromise viewer immersion. There's a need for holistic, audio-synchronized full-body motion editing.

Method: InfiniteTalk - a streaming audio-driven generator for infinite-length dubbing. Uses temporal context frames for seamless transitions and optimized sampling strategy with fine-grained reference frame positioning to achieve adaptive conditioning.

Result: State-of-the-art performance on HDTF, CelebV-HQ, and EMTD datasets. Superior visual realism, emotional coherence, and full-body motion synchronization demonstrated through quantitative metrics.

Conclusion: The proposed sparse-frame video dubbing approach with InfiniteTalk architecture successfully overcomes limitations of traditional dubbing methods, enabling holistic audio-driven human animation with preserved identity and synchronized full-body motion.

Abstract: Recent breakthroughs in video AIGC have ushered in a transformative era for audio-driven human animation. However, conventional video dubbing techniques remain constrained to mouth region editing, resulting in discordant facial expressions and body gestures that compromise viewer immersion. To overcome this limitation, we introduce sparse-frame video dubbing, a novel paradigm that strategically preserves reference keyframes to maintain identity, iconic gestures, and camera trajectories while enabling holistic, audio-synchronized full-body motion editing. Through critical analysis, we identify why naive image-to-video models fail in this task, particularly their inability to achieve adaptive conditioning. Addressing this, we propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing. This architecture leverages temporal context frames for seamless inter-chunk transitions and incorporates a simple yet effective sampling strategy that optimizes control strength via fine-grained reference frame positioning. Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance. Quantitative metrics confirm superior visual realism, emotional coherence, and full-body motion synchronization.

[145] GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation

Ken Deng, Yunhan Yang, Jingxiang Sun, Xihui Liu, Yebin Liu, Ding Liang, Yan-Pei Cao

Main category: cs.CV

TL;DR: DetailGen3D is a generative method that enhances geometric details in 3D shapes generated from sparse views, using data-dependent flows in latent space and token matching for efficient refinement.

DetailsMotivation: Existing 3D generation methods from sparse/single views often produce shapes lacking geometric detail due to computational constraints, creating a need for efficient detail enhancement.

Method: Uses data-dependent flows in latent space for coarse-to-fine transformation, token matching strategy for spatial correspondence, and carefully designed training data matching synthesized coarse shape characteristics.

Result: Achieves high-fidelity geometric detail synthesis while maintaining training efficiency, effectively enhancing shapes from various 3D generation and reconstruction approaches.

Conclusion: DetailGen3D provides an efficient solution for adding geometric details to 3D shapes generated from sparse inputs, overcoming computational limitations of traditional methods.

Abstract: Modern 3D generation methods can rapidly create shapes from sparse or single views, but their outputs often lack geometric detail due to computational constraints. We present DetailGen3D, a generative approach specifically designed to enhance these generated 3D shapes. Our key insight is to model the coarse-to-fine transformation directly through data-dependent flows in latent space, avoiding the computational overhead of large-scale 3D generative models. We introduce a token matching strategy that ensures accurate spatial correspondence during refinement, enabling local detail synthesis while preserving global structure. By carefully designing our training data to match the characteristics of synthesized coarse shapes, our method can effectively enhance shapes produced by various 3D generation and reconstruction approaches, from single-view to sparse multi-view inputs. Extensive experiments demonstrate that DetailGen3D achieves high-fidelity geometric detail synthesis while maintaining efficiency in training.

[146] Distilled-3DGS:Distilled 3D Gaussian Splatting

Lintao Xiang, Xinkai Chen, Jianhuang Lai, Guangcong Wang

Main category: cs.CV

TL;DR: Knowledge distillation framework for 3D Gaussian Splatting that reduces memory/storage requirements while maintaining high-fidelity rendering quality.

DetailsMotivation: 3DGS requires large numbers of 3D Gaussians for high-fidelity rendering, leading to substantial memory consumption and storage requirements that need to be addressed.

Method: Proposes knowledge distillation with multiple teacher models (vanilla 3DGS, noise-augmented variants, dropout-regularized versions) and a structural similarity loss to maintain geometric consistency between student and teacher models.

Result: Achieves promising rendering results in both quality and storage efficiency compared to state-of-the-art methods across diverse datasets.

Conclusion: Distilled-3DGS is a simple yet effective framework that successfully addresses the memory/storage limitations of 3DGS while maintaining rendering fidelity.

Abstract: 3D Gaussian Splatting (3DGS) has exhibited remarkable efficacy in novel view synthesis (NVS). However, it suffers from a significant drawback: achieving high-fidelity rendering typically necessitates a large number of 3D Gaussians, resulting in substantial memory consumption and storage requirements. To address this challenge, we propose the first knowledge distillation framework for 3DGS, featuring various teacher models, including vanilla 3DGS, noise-augmented variants, and dropout-regularized versions. The outputs of these teachers are aggregated to guide the optimization of a lightweight student model. To distill the hidden geometric structure, we propose a structural similarity loss to boost the consistency of spatial geometric distributions between the student and teacher model. Through comprehensive quantitative and qualitative evaluations across diverse datasets, the proposed Distilled-3DGS, a simple yet effective framework without bells and whistles, achieves promising rendering results in both rendering quality and storage efficiency compared to state-of-the-art methods. Project page: https://distilled3dgs.github.io . Code: https://github.com/lt-xiang/Distilled-3DGS .

[147] Beyond Simple Edits: Composed Video Retrieval with Dense Modifications

Omkar Thawakar, Dmitry Demidov, Ritesh Thawkar, Rao Muhammad Anwer, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan

Main category: cs.CV

TL;DR: Novel dataset Dense-WebVid-CoVR with 1.6M samples and dense modification text, plus Cross-Attention fusion model achieving state-of-the-art 71.3% Recall@1 for composed video retrieval.

DetailsMotivation: Standard retrieval frameworks struggle with fine-grained compositional queries and temporal variations in video retrieval tasks, limiting their ability to handle detailed modifications.

Method: Developed a new model integrating visual and textual information through Cross-Attention fusion using grounded text encoder for precise alignment between dense query modifications and target videos.

Result: Achieved state-of-the-art results with 71.3% Recall@1 in visual+text setting, outperforming previous methods by 3.4% across all metrics.

Conclusion: The proposed dataset and model effectively address fine-grained composed video retrieval challenges, demonstrating superior performance in leveraging detailed video descriptions and dense modification texts.

Abstract: Composed video retrieval is a challenging task that strives to retrieve a target video based on a query video and a textual description detailing specific modifications. Standard retrieval frameworks typically struggle to handle the complexity of fine-grained compositional queries and variations in temporal understanding limiting their retrieval ability in the fine-grained setting. To address this issue, we introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments, enabling more detailed compositional changes in retrieved video content. The proposed dataset, named Dense-WebVid-CoVR, consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart. We further develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion using grounded text encoder, enabling precise alignment between dense query modifications and target videos. The proposed model achieves state-of-the-art results surpassing existing methods on all metrics. Notably, it achieves 71.3% Recall@1 in visual+text setting and outperforms the state-of-the-art by 3.4%, highlighting its efficacy in terms of leveraging detailed video descriptions and dense modification texts. Our proposed dataset, code, and model are available at :https://github.com/OmkarThawakar/BSE-CoVR

[148] LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos

Chin-Yang Lin, Cheng Sun, Fu-En Yang, Min-Hung Chen, Yen-Yu Lin, Yu-Lun Liu

Main category: cs.CV

TL;DR: LongSplat is a novel 3D Gaussian Splatting framework that addresses challenges in novel view synthesis from long videos with irregular camera motion, unknown poses, and expansive scenes, achieving state-of-the-art results through incremental joint optimization, robust pose estimation, and efficient octree anchor formation.

DetailsMotivation: Current methods for novel view synthesis from long videos suffer from pose drift, inaccurate geometry initialization, and severe memory limitations, especially with irregular camera motion and unknown camera poses in expansive scenes.

Method: LongSplat introduces three key components: (1) Incremental Joint Optimization that concurrently optimizes camera poses and 3D Gaussians, (2) a robust Pose Estimation Module leveraging learned 3D priors, and (3) an efficient Octree Anchor Formation mechanism that converts dense point clouds into anchors based on spatial density.

Result: Extensive experiments on challenging benchmarks demonstrate that LongSplat achieves state-of-the-art results, substantially improving rendering quality, pose accuracy, and computational efficiency compared to prior approaches.

Conclusion: LongSplat provides a robust solution for novel view synthesis from long, casually captured videos, effectively addressing critical challenges of pose drift, geometry initialization, and memory limitations while maintaining high performance and efficiency.

Abstract: LongSplat addresses critical challenges in novel view synthesis (NVS) from casually captured long videos characterized by irregular camera motion, unknown camera poses, and expansive scenes. Current methods often suffer from pose drift, inaccurate geometry initialization, and severe memory limitations. To address these issues, we introduce LongSplat, a robust unposed 3D Gaussian Splatting framework featuring: (1) Incremental Joint Optimization that concurrently optimizes camera poses and 3D Gaussians to avoid local minima and ensure global consistency; (2) a robust Pose Estimation Module leveraging learned 3D priors; and (3) an efficient Octree Anchor Formation mechanism that converts dense point clouds into anchors based on spatial density. Extensive experiments on challenging benchmarks demonstrate that LongSplat achieves state-of-the-art results, substantially improving rendering quality, pose accuracy, and computational efficiency compared to prior approaches. Project page: https://linjohnss.github.io/longsplat/

[149] Advancing Toward Robust and Scalable Fingerprint Orientation Estimation: From Gradients to Deep Learning

Amit Kumar Trivedi, Jasvinder Pal Singh

Main category: cs.CV

TL;DR: The paper analyzes the evolution of fingerprint orientation estimation methods, highlighting limitations of current approaches and proposing hybrid gradient-based machine learning methods for improved performance.

DetailsMotivation: To address persistent challenges in fingerprint recognition including degraded image quality, damaged ridge structures, and background noise that impact system performance and reliability.

Method: Proposes developing hybrid methods that combine the simplicity and efficiency of gradient-based techniques with the adaptability and robustness of machine learning approaches.

Result: Identifies clear evolution from traditional to machine learning methods, but current algorithms still face significant performance limitations across varied conditions.

Conclusion: Future research should focus on efficient algorithms with lower computational complexity while maintaining robust performance, which could enhance scalability and broader applicability of biometric systems in security technologies.

Abstract: The study identifies a clear evolution from traditional methods to more advanced machine learning approaches. Current algorithms face persistent challenges, including degraded image quality, damaged ridge structures, and background noise, which impact performance. To overcome these limitations, future research must focus on developing efficient algorithms with lower computational complexity while maintaining robust performance across varied conditions. Hybrid methods that combine the simplicity and efficiency of gradient-based techniques with the adaptability and robustness of machine learning are particularly promising for advancing fingerprint recognition systems. Fingerprint orientation estimation plays a crucial role in improving the reliability and accuracy of biometric systems. This study highlights the limitations of current approaches and underscores the importance of designing next-generation algorithms that can operate efficiently across diverse application domains. By addressing these challenges, future developments could enhance the scalability, reliability, and applicability of biometric systems, paving the way for broader use in security and identification technologies.

[150] LEGO: Learning and Graph-Optimized Modular Tracker for Online Multi-Object Tracking with Point Clouds

Zhenrong Zhang, Jianan Liu, Yuxuan Xia, Tao Huang, Qing-Long Han, Hongbin Liu

Main category: cs.CV

TL;DR: LEGO tracker combines graph optimization and self-attention for improved multi-object tracking, achieving top performance on KITTI benchmark using LiDAR alone.

DetailsMotivation: Improve data association performance in online multi-object tracking for autonomous systems, addressing limitations in existing tracking-by-detection approaches.

Method: Integrates graph optimization and self-attention mechanisms to formulate association score maps, combined with Kalman filter for state updates and temporal coherence.

Result: Achieved exceptional performance, ranking 1st at submission time and remaining 2nd among all online trackers in KITTI MOT benchmark for cars, outperforming both LiDAR-based and fusion methods.

Conclusion: The proposed LEGO modular tracker demonstrates superior data association capabilities and state-of-the-art performance using LiDAR data alone, making it highly effective for autonomous systems.

Abstract: Online multi-object tracking (MOT) plays a pivotal role in autonomous systems. The state-of-the-art approaches usually employ a tracking-by-detection method, and data association plays a critical role. This paper proposes a learning and graph-optimized (LEGO) modular tracker to improve data association performance in the existing literature. The proposed LEGO tracker integrates graph optimization and self-attention mechanisms, which efficiently formulate the association score map, facilitating the accurate and efficient matching of objects across time frames. To further enhance the state update process, the Kalman filter is added to ensure consistent tracking by incorporating temporal coherence in the object states. Our proposed method utilizing LiDAR alone has shown exceptional performance compared to other online tracking approaches, including LiDAR-based and LiDAR-camera fusion-based methods. LEGO ranked 1st at the time of submitting results to KITTI object tracking evaluation ranking board and remains 2nd at the time of submitting this paper, among all online trackers in the KITTI MOT benchmark for cars1

[151] Diffusion Noise Feature: Accurate and Fast Generated Image Detection

Yichi Zhang, Xiaogang Xu

Main category: cs.CV

TL;DR: Proposes Diffusion Noise Feature (DNF) - a novel representation derived from diffusion model inverse process that amplifies high-frequency artifacts to detect generated images with high accuracy and generalization.

DetailsMotivation: Current generated image detection methods suffer from low accuracy and poor generalization, creating risks for misinformation spread despite the creative potential of realistic AI-generated images.

Method: Extracts Diffusion Noise Feature (DNF) from the inverse process of diffusion models to amplify subtle generation artifacts, then trains a simple classifier (ResNet-50) on these features for detection.

Result: Achieves state-of-the-art performance with remarkable accuracy, robustness, and generalization across 4 training datasets and 5 test sets, including detection of images from unseen generators and novel content.

Conclusion: DNF provides a robust basis for differentiating real from generated images by capturing distinct signatures in the diffusion noise domain, establishing a new benchmark for generated image detection.

Abstract: Generative models now produce images with such stunning realism that they can easily deceive the human eye. While this progress unlocks vast creative potential, it also presents significant risks, such as the spread of misinformation. Consequently, detecting generated images has become a critical research challenge. However, current detection methods are often plagued by low accuracy and poor generalization. In this paper, to address these limitations and enhance the detection of generated images, we propose a novel representation, Diffusion Noise Feature (DNF). Derived from the inverse process of diffusion models, DNF effectively amplifies the subtle, high-frequency artifacts that act as fingerprints of artificial generation. Our key insight is that real and generated images exhibit distinct DNF signatures, providing a robust basis for differentiation. By training a simple classifier such as ResNet-50 on DNF, our approach achieves remarkable accuracy, robustness, and generalization in detecting generated images, including those from unseen generators or with novel content. Extensive experiments across four training datasets and five test sets confirm that DNF establishes a new state-of-the-art in generated image detection. The code is available at https://github.com/YichiCS/Diffusion-Noise-Feature.

[152] A global optimization SAR image segmentation model can be easily transformed to a general ROF denoising model

Guangming Liu

Main category: cs.CV

TL;DR: Novel locally statistical active contour model for SAR image segmentation using AA denoising and variational level set method, transformed into global optimization via convex relaxation with two efficient solving approaches.

DetailsMotivation: To address SAR image segmentation with intensity inhomogeneity by developing efficient global optimization models that avoid complex PDEs and difference equations.

Method: Proposed LACM based on AA denoising and variational level set, transformed to global optimization using convex relaxation. Developed two fast models: one using proximal function to create ROF model solvable by fast denoising algorithm, and another using different splitting approach.

Result: Experiments on synthetic and Envisat SAR images demonstrated superiority over state-of-the-art models with efficient segmentation performance.

Conclusion: The proposed models provide effective and efficient SAR image segmentation with global optimization capabilities, outperforming existing methods while avoiding complex computational requirements.

Abstract: In this paper, we propose a novel locally statistical active contour model (LACM) based on Aubert-Aujol (AA) denoising model and variational level set method, which can be used for SAR images segmentation with intensity inhomogeneity. Then we transform the proposed model into a global optimization model by using convex relaxation technique. Firstly, we apply the Split Bregman technique to transform the global optimization model into two alternating optimization processes of Shrink operator and Laplace operator, which is called SB_LACM model. Moreover, we propose two fast models to solve the global optimization model , which are more efficient than the SB_LACM model. The first model is: we add the proximal function to transform the global optimization model to a general ROF model[29], which can be solved by a fast denoising algorithm proposed by R.-Q.Jia, and H.Zhao; [29] is submitted on 29-Aug-2013, and our early edition ever submitted to TGRS on 12-Jun-2012, Venkatakrishnan et al. [30] proposed their ‘pnp algorithm’ on 29-May-2013, so Venkatakrishnan and we proposed the ‘pnp algorithm’ almost simultaneously. Thus we obtain a fast segmentation algorithm with global optimization solver that does not involve partial differential equations or difference equation, and only need simple difference computation. The second model is: we use a different splitting approach than one model to transform the global optimization model into a differentiable term and a general ROF model term, which can be solved by the same technique as the first model. Experiments using some challenging synthetic images and Envisat SAR images demonstrate the superiority of our proposed models with respect to the state-of-the-art models.

[153] SAR image segmentation algorithms based on I-divergence-TV model

Guangming Liu

Main category: cs.CV

TL;DR: Novel variational active contour model using I-divergence-TV for SAR image segmentation with multiplicative gamma noise, combining edge-based and region-based approaches with fast fixed point algorithm.

DetailsMotivation: To address the challenge of segmenting Synthetic Aperture Radar (SAR) images contaminated by multiplicative gamma noise, which requires robust segmentation that can handle weak/blurred edges and automatically detect boundaries.

Method: Proposes a hybrid edge-based and region-based variational active contour model based on I-divergence-TV. Incorporates global convex segmentation method and split Bregman technique, and develops a fast fixed point algorithm for efficient solution.

Result: Experimental results on synthetic and real SAR images demonstrate that the proposed fast fixed point algorithm is both robust and efficient compared to state-of-the-art approaches.

Conclusion: The proposed model effectively segments SAR images with multiplicative gamma noise, handles weak edges well, automatically detects boundaries, and the developed algorithm provides efficient and robust performance.

Abstract: In this paper, we propose a novel variational active contour model based on I-divergence-TV model to segment Synthetic aperture radar (SAR) images with multiplicative gamma noise, which hybrides edge-based model with region-based model. The proposed model can efficiently stop the contours at weak or blurred edges, and can automatically detect the exterior and interior boundaries of images. We incorporate the global convex segmentation method and split Bregman technique into the proposed model, and propose a fast fixed point algorithm to solve the global convex segmentation question[25]. [25] is submitted on 29-Aug-2013, and our early edition ever submitted to TGRS on 12-Jun-2012, Venkatakrishnan et al. [26] proposed their ‘pnp algorithm’ on 29-May-2013, so Venkatakrishnan and we proposed the ‘pnp algorithm’ almost simultaneously. Experimental results for synthetic images and real SAR images show that the proposed fast fixed point algorithm is robust and efficient compared with the state-of-the-art approach.

[154] Active contours driven by local and global intensity fitting energy with application to SAR image segmentation and its fast solvers

Guangming Liu

Main category: cs.CV

TL;DR: Novel variational active contour model combining GAC and ACWE for SAR image segmentation with multiplicative gamma noise, featuring fast fixed-point algorithms.

DetailsMotivation: To develop an efficient segmentation method for SAR images corrupted by multiplicative gamma noise, addressing weak/blurred edges and improving computational speed.

Method: Hybrid geodesic active contour (GAC) with active contours without edges (ACWE) based on Aubert-Aujol denoising model, transformed to ROF model with proximity term. Two fast fixed-point algorithms inspired by Jia-Zhao denoising.

Result: Efficiently stops contours at weak/blurred edges, automatically detects interior/exterior boundaries in SAR images with gamma noise. Algorithms are robust to initialization and 15% faster than Goldstein-Osher method.

Conclusion: The proposed model and fast algorithms provide effective and efficient segmentation of SAR images with multiplicative gamma noise, demonstrating improved edge detection and computational performance.

Abstract: In this paper, we propose a novel variational active contour model based on Aubert-Aujol (AA) denoising model, which hybrides geodesic active contour (GAC) model with active contours without edges (ACWE) model and can be used to segment images corrupted by multiplicative gamma noise. We transform the proposed model into classic ROF model by adding a proximity term. [26] is submitted on 29-Aug-2013, and our early edition ever submitted to TGRS on 12-Jun-2012, Venkatakrishnan et al. [27] proposed their ‘pnp algorithm’ on 29-May-2013, so Venkatakrishnan and we proposed the ‘pnp algorithm’almost simultaneously. Inspired by a fast denosing algorithm proposed by Jia-Zhao recently, we propose two fast fixed point algorithms to solve SAR image segmentation question. Experimental results for real SAR images show that the proposed image segmentation model can efficiently stop the contours at weak or blurred edges, and can automatically detect the exterior and interior boundaries of images with multiplicative gamma noise. The proposed fast fixed point algorithms are robustness to initialization contour, and can further reduce about 15% of the time needed for algorithm proposed by Goldstein-Osher.

[155] Fusing Echocardiography Images and Medical Records for Continuous Patient Stratification

Nathan Painchaud, Jérémie Stym-Popper, Pierre-Yves Courand, Nicolas Thome, Pierre-Marc Jodoin, Nicolas Duchateau, Olivier Bernard

Main category: cs.CV

TL;DR: Transformer-based multimodal fusion method achieves 96.8% AUROC for hypertension stratification using echocardiogram and clinical data, revealing detailed cardiac function patterns along pathological continuum.

DetailsMotivation: Hypertension presents as a difficult-to-characterize continuum that requires integration of both fine-grained echocardiographic descriptors and global clinical variables for accurate patient assessment and stratification.

Method: Projects each variable into modality-specific representation spaces, then uses Transformer encoder to merge multimodal data into comprehensive patient representation through ordinal classification for pathological continuum learning.

Result: Outstanding 96.8% AUROC performance with limited data (<200 samples), reproducible stratification (5.7% MAE), and emergence of patterns aligning with established hypertension physiology plus novel insights.

Conclusion: The XTab architecture enables effective hypertension characterization from limited multimodal data, providing unprecedented details about hypertension’s impact on cardiac function and paving way for more comprehensive pathology understanding.

Abstract: Deep learning enables automatic and robust extraction of cardiac function descriptors from echocardiographic sequences, such as ejection fraction or strain. These descriptors provide fine-grained information that physicians consider, in conjunction with more global variables from the clinical record, to assess patients’ condition. Drawing on novel Transformer models applied to tabular data, we propose a method that considers all descriptors extracted from medical records and echocardiograms to learn the representation of a cardiovascular pathology with a difficult-to-characterize continuum, namely hypertension. Our method first projects each variable into its own representation space using modality-specific approaches. These standardized representations of multimodal data are then fed to a Transformer encoder, which learns to merge them into a comprehensive representation of the patient through the task of predicting a clinical rating. This stratification task is formulated as an ordinal classification to enforce a pathological continuum in the representation space. We observe the major trends along this continuum on a cohort of 239 hypertensive patients, providing unprecedented details in the description of hypertension’s impact on various cardiac function descriptors. Our analysis shows that i) the XTab foundation model’s architecture allows to reach outstanding performance (96.8% AUROC) even with limited data (less than 200 training samples), ii) stratification across the population is reproducible between trainings (within 5.7% mean absolute error), and iii) patterns emerge in descriptors, some of which align with established physiological knowledge about hypertension, while others could pave the way for a more comprehensive understanding of this pathology. Code is available at https://github.com/creatis-myriad/didactic.

[156] ContrastAlign: Toward Robust BEV Feature Alignment via Contrastive Learning for Multi-Modal 3D Object Detection

Ziying Song, Hongyu Pan, Feiyang Jia, Yongchang Zhang, Lin Liu, Lei Yang, Shaoqing Xu, Peiliang Wu, Caiyan Jia, Zheng Zhang, Yadan Luo

Main category: cs.CV

TL;DR: ContrastAlign uses contrastive learning and graph matching to align LiDAR and camera features in BEV 3D object detection, improving robustness to sensor misalignment and achieving state-of-the-art performance.

DetailsMotivation: Existing LiDAR-camera BEV fusion methods suffer from feature misalignment due to imprecise sensor calibration, which causes depth estimation errors and reduces detection accuracy.

Method: Three-component approach: L-Instance extracts LiDAR instance features, C-Instance predicts camera instance features via RoI pooling, and InstanceFusion uses contrastive learning to align heterogeneous modalities followed by graph matching for feature similarity.

Result: Achieves 71.5% mAP on nuScenes val set (1.4% better than GraphBEV), excels under spatial & temporal misalignment noise (1.4-11.1% improvement over BEVFusion), and outperforms GraphBEV by 1.0% on Argoverse2.

Conclusion: ContrastAlign effectively addresses feature misalignment in multi-sensor fusion, demonstrating superior performance especially at longer distances where misalignment is more severe.

Abstract: In the field of 3D object detection tasks, fusing heterogeneous features from LiDAR and camera sensors into a unified Bird’s Eye View (BEV) representation is a widely adopted paradigm. However, existing methods often suffer from imprecise sensor calibration, leading to feature misalignment in LiDAR-camera BEV fusion. Moreover, such inaccuracies cause errors in depth estimation for the camera branch, aggravating misalignment between LiDAR and camera BEV features. In this work, we propose a novel ContrastAlign approach that utilizes contrastive learning to enhance the alignment of heterogeneous modalities, thereby improving the robustness of the fusion process. Specifically, our approach comprises three key components: (1) the L-Instance module, which extracts LiDAR instance features within the LiDAR BEV features; (2) the C-Instance module, which predicts camera instance features through Region of Interest (RoI) pooling on the camera BEV features; (3) the InstanceFusion module, which employs contrastive learning to generate consistent instance features across heFterogeneous modalities. Subsequently, we use graph matching to calculate the similarity between the neighboring camera instance features and the similarity instance features to complete the alignment of instance features. Our method achieves SOTA performance, with an mAP of 71.5%, surpassing GraphBEV by 1.4% on the nuScenes val set. Importantly, our method excels BEVFusion under conditions with spatial & temporal misalignment noise, improving mAP by 1.4% and 11.1% on nuScenes dataset. Notably, on the Argoverse2 dataset, ContrastAlign outperforms GraphBEV by 1.0% in mAP, indicating that the farther the distance, the more severe the feature misalignment and the more effective.

[157] HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Model

Hieu T. Nguyen, Yiwen Chen, Vikram Voleti, Varun Jampani, Huaizu Jiang

Main category: cs.CV

TL;DR: HouseCrafter is a novel method that converts 2D floorplans into complete 3D indoor scenes using a 2D diffusion model to generate consistent multi-view RGB-D images, which are then reconstructed into 3D scenes.

DetailsMotivation: To automate the creation of large-scale 3D indoor scenes from simple floorplans, addressing the challenge of generating consistent multi-view imagery for 3D reconstruction.

Method: Adapts a 2D diffusion model trained on web-scale images to generate RGB-D images autoregressively along sampled locations based on the floorplan, using previously generated images as conditions and ensuring consistency through global floorplan attention design.

Result: Demonstrates high-quality house-scale 3D scene generation on the 3D-Front dataset, with ablation studies validating design choices.

Conclusion: HouseCrafter effectively lifts 2D floorplans into complete 3D indoor scenes using diffusion-based multi-view image generation and reconstruction, showing promising results for automated 3D scene creation.

Abstract: We introduce HouseCrafter, a novel approach that can lift a floorplan into a complete large 3D indoor scene (e.g., a house). Our key insight is to adapt a 2D diffusion model, which is trained on web-scale images, to generate consistent multi-view color (RGB) and depth (D) images across different locations of the scene. Specifically, the RGB-D images are generated autoregressively in a batch-wise manner along sampled locations based on the floorplan, where previously generated images are used as condition to the diffusion model to produce images at nearby locations. The global floorplan and attention design in the diffusion model ensures the consistency of the generated images, from which a 3D scene can be reconstructed. Through extensive evaluation on the 3D-Front dataset, we demonstrate that HouseCraft can generate high-quality house-scale 3D scenes. Ablation studies also validate the effectiveness of different design choices. We will release our code and model weights. Project page: https://neu-vi.github.io/houseCrafter/

[158] Unsupervised Anomaly Detection Using Diffusion Trend Analysis for Display Inspection

Eunwoo Kim, Un Yang, Cheol Lae Roh, Stefano Ermon

Main category: cs.CV

TL;DR: Proposes a novel anomaly detection method using reconstruction trend analysis across degradation levels to overcome limitations of diffusion-based approaches in display inspection.

DetailsMotivation: Current reconstruction-based anomaly detection using denoising diffusion models has limitations in determining optimal noise parameters and suffers from significant normal region fluctuations during reconstruction, leading to false detections.

Method: Analyzes reconstruction trends across different degrees of degradation to effectively detect anomalies, addressing both parameter selection and false detection issues.

Result: The method effectively solves the practical application problems in display inspection by providing more reliable anomaly detection through trend analysis.

Conclusion: The proposed reconstruction trend analysis approach overcomes key limitations of diffusion-based anomaly detection methods, making it suitable for practical display inspection applications.

Abstract: Reconstruction-based anomaly detection via denoising diffusion model has limitations in determining appropriate noise parameters that can degrade anomalies while preserving normal characteristics. Also, normal regions can fluctuate considerably during reconstruction, resulting in false detection. In this paper, we propose a method to detect anomalies by analysis of reconstruction trend depending on the degree of degradation, effectively solving the both problems that impede practical application in display inspection.

[159] Vision Backbone Efficient Selection for Image Classification in Low-Data Regimes

Joris Guerin, Shray Bansal, Amirreza Shaban, Paulo Mann, Harshvardhan Gazula

Main category: cs.CV

TL;DR: Backbone selection for transfer learning is dataset-dependent, especially in low-data scenarios. The paper introduces VIBES - an efficient method to find optimal backbones from large pools using simple search strategies.

DetailsMotivation: Current backbone selection approaches rely on universal benchmarks, but backbone effectiveness varies significantly across datasets, particularly when training data is limited. Exhaustive evaluation of large backbone pools is computationally impractical.

Method: Formalized Vision Backbone Efficient Selection (VIBES) problem and proposed several heuristics for searching high-performing backbones under computational constraints. Tested on four diverse datasets with over 1300 pretrained models.

Result: Simple search strategies can find well-suited backbones that outperform generic benchmark recommendations within just 10 minutes of search time on a single GPU.

Conclusion: Dataset-specific backbone selection is viable and practical for low-data regimes, with efficient search methods enabling optimal backbone discovery without exhaustive computation.

Abstract: Transfer learning has become an essential tool in modern computer vision, allowing practitioners to leverage backbones, pretrained on large datasets, to train successful models from limited annotated data. Choosing the right backbone is crucial, especially for small datasets, since final performance depends heavily on the quality of the initial feature representations. While prior work has conducted benchmarks across various datasets to identify universal top-performing backbones, we demonstrate that backbone effectiveness is highly dataset-dependent, especially in low-data scenarios where no single backbone consistently excels. To overcome this limitation, we introduce dataset-specific backbone selection as a new research direction and investigate its practical viability in low-data regimes. Since exhaustive evaluation is computationally impractical for large backbone pools, we formalize Vision Backbone Efficient Selection (VIBES) as the problem of searching for high-performing backbones under computational constraints. We define the solution space, propose several heuristics, and demonstrate VIBES feasibility for low-data image classification by performing experiments on four diverse datasets. Our results show that even simple search strategies can find well-suited backbones within a pool of over $1300$ pretrained models, outperforming generic benchmark recommendations within just ten minutes of search time on a single GPU (NVIDIA RTX A5000).

[160] WHALES: A Multi-Agent Scheduling Dataset for Enhanced Cooperation in Autonomous Driving

Yinsong Wang, Siwei Chen, Ziyi Song, Sheng Zhou

Main category: cs.CV

TL;DR: WHALES is the first large-scale V2X dataset designed for communication-aware cooperative perception, featuring 2.01M annotated 3D objects and detailed communication metadata to benchmark scheduling strategies.

DetailsMotivation: Address the lack of datasets capturing real-world V2X complexity under dynamic communication constraints for cooperative perception research.

Method: Introduce WHALES dataset with 8.4 cooperative agents per scene and communication metadata, plus propose Coverage-Aware Historical Scheduler (CAHS) for agent selection based on historical viewpoint coverage.

Result: WHALES enables rigorous evaluation of scheduling strategies and CAHS improves perception performance over existing state-of-the-art methods.

Conclusion: WHALES bridges the gap between simulated and real-world V2X challenges, providing a robust framework for perception-scheduling co-design and scalability research.

Abstract: Cooperative perception research is hindered by the limited availability of datasets that capture the complexity of real-world Vehicle-to-Everything (V2X) interactions, particularly under dynamic communication constraints. To address this gap, we introduce WHALES (Wireless enhanced Autonomous vehicles with Large number of Engaged agents), the first large-scale V2X dataset explicitly designed to benchmark communication-aware agent scheduling and scalable cooperative perception. WHALES introduces a new benchmark that enables state-of-the-art (SOTA) research in communication-aware cooperative perception, featuring an average of 8.4 cooperative agents per scene and 2.01 million annotated 3D objects across diverse traffic scenarios. It incorporates detailed communication metadata to emulate real-world communication bottlenecks, enabling rigorous evaluation of scheduling strategies. To further advance the field, we propose the Coverage-Aware Historical Scheduler (CAHS), a novel scheduling baseline that selects agents based on historical viewpoint coverage, improving perception performance over existing SOTA methods. WHALES bridges the gap between simulated and real-world V2X challenges, providing a robust framework for exploring perception-scheduling co-design, cross-data generalization, and scalability limits. The WHALES dataset and code are available at https://github.com/chensiweiTHU/WHALES.

[161] ResFlow: Fine-tuning Residual Optical Flow for Event-based High Temporal Resolution Motion Estimation

Qianang Zhou, Zhiyu Zhu, Junhui Hou, Yongjian Deng, Youfu Li, Junlin Xiong

Main category: cs.CV

TL;DR: A residual-based paradigm for high-temporal-resolution optical flow estimation with event cameras, addressing sparsity and lack of ground truth through two-stage estimation and novel noise-based learning strategies.

DetailsMotivation: Event cameras promise high-temporal-resolution motion estimation but face challenges with event data sparsity and absence of HTR ground-truth data, while existing flow accumulation methods suffer from accumulation errors.

Method: Two-stage approach: global linear motion estimation followed by HTR residual flow refinement. Uses shared refiner for LTR supervision and HTR inference, plus regional noise simulation to adapt from LTR to HTR inference.

Result: Achieves state-of-the-art accuracy in both LTR and HTR metrics, demonstrating effectiveness and superiority over existing approaches.

Conclusion: The residual paradigm effectively mitigates event sparsity impacts and works with any LTR algorithm, while the noise-based strategy enables both supervised and self-supervised training for HTR optical flow.

Abstract: Event cameras hold significant promise for high-temporal-resolution (HTR) motion estimation. However, estimating event-based HTR optical flow faces two key challenges: the absence of HTR ground-truth data and the intrinsic sparsity of event data. Most existing approaches rely on the flow accumulation paradigms to indirectly supervise intermediate flows, often resulting in accumulation errors and optimization difficulties. To address these challenges, we propose a residual-based paradigm for estimating HTR optical flow with event data. Our approach separates HTR flow estimation into two stages: global linear motion estimation and HTR residual flow refinement. The residual paradigm effectively mitigates the impacts of event sparsity on optimization and is compatible with any LTR algorithm. Next, to address the challenge posed by the absence of HTR ground truth, we incorporate novel learning strategies. Specifically, we initially employ a shared refiner to estimate the residual flows, enabling both LTR supervision and HTR inference. Subsequently, we introduce regional noise to simulate the residual patterns of intermediate flows, facilitating the adaptation from LTR supervision to HTR inference. Additionally, we show that the noise-based strategy supports in-domain self-supervised training. Comprehensive experimental results demonstrate that our approach achieves state-of-the-art accuracy in both LTR and HTR metrics, highlighting its effectiveness and superiority.

[162] Benchmarking Federated Learning for Semantic Datasets: Federated Scene Graph Generation

SeungBum Ha, Taehwan Lee, Jiyoun Lim, Sung Whan Yoon

Main category: cs.CV

TL;DR: Proposes a new federated learning benchmark framework for complex multi-semantic vision tasks with controllable semantic heterogeneity across clients, addressing limitations of existing FL benchmarks that only handle simple classification tasks.

DetailsMotivation: Existing FL benchmarks focus on simple classification tasks with one-hot labels, but lack support for complex semantic scenarios where samples contain diverse semantic information like object relations. Managing semantic heterogeneity across clients in FL settings is challenging and not addressed by current benchmarks.

Method: Two-step benchmark process: (1) data clustering with semantics, and (2) data distribution via controllable semantic heterogeneity across clients. Constructed a federated PSG (Panoptic Scene Graph) benchmark as proof of concept to evaluate existing PSG methods in FL settings.

Result: Successfully created the first FL benchmark framework for multi-semantic vision tasks with controlled semantic heterogeneity. Demonstrated effectiveness by applying robust federated learning algorithms that showed increased performance on data heterogeneity.

Conclusion: This work provides the first benchmark framework enabling federated learning and evaluation for complex multi-semantic vision tasks under controlled semantic heterogeneity, filling a critical gap in FL research for handling complicated semantic scenarios.

Abstract: Federated learning (FL) enables decentralized training while preserving data privacy, yet existing FL benchmarks address relatively simple classification tasks, where each sample is annotated with a one-hot label. However, little attention has been paid to demonstrating an FL benchmark that handles complicated semantics, where each sample encompasses diverse semantic information, such as relations between objects. Because the existing benchmarks are designed to distribute data in a narrow view of a single semantic, managing the complicated semantic heterogeneity across clients when formalizing FL benchmarks is non-trivial. In this paper, we propose a benchmark process to establish an FL benchmark with controllable semantic heterogeneity across clients: two key steps are (i) data clustering with semantics and (ii) data distributing via controllable semantic heterogeneity across clients. As a proof of concept, we construct a federated PSG benchmark, demonstrating the efficacy of the existing PSG methods in an FL setting with controllable semantic heterogeneity of scene graphs. We also present the effectiveness of our benchmark by applying robust federated learning algorithms to data heterogeneity to show increased performance. To our knowledge, this is the first benchmark framework that enables federated learning and its evaluation for multi-semantic vision tasks under the controlled semantic heterogeneity. Our code is available at https://github.com/Seung-B/FL-PSG.

[163] Image Augmentation Agent for Weakly Supervised Semantic Segmentation

Wangyu Wu, Xianglin Qiu, Siqi Song, Zhenhong Chen, Xiaowei Huang, Fei Ma, Jimin Xiao

Main category: cs.CV

TL;DR: Image Augmentation Agent (IAA) enhances weakly-supervised semantic segmentation by automatically generating diverse training images using LLMs and diffusion models, achieving state-of-the-art results.

DetailsMotivation: Most WSSS methods focus on network structures and loss functions while overlooking dataset limitations. More diverse training images can provide richer information and help models understand comprehensive semantic patterns.

Method: Developed IAA with LLM-based prompt generation with self-refinement mechanism and diffusion models with online filtering to automatically generate high-quality, balanced additional training images.

Result: Significantly surpasses state-of-the-art WSSS approaches on PASCAL VOC 2012 and MS COCO 2014 datasets.

Conclusion: Enhancing WSSS from a data generation perspective through automated image augmentation is effective and provides substantial performance improvements.

Abstract: Weakly-supervised semantic segmentation (WSSS) has achieved remarkable progress using only image-level labels. However, most existing WSSS methods focus on designing new network structures and loss functions to generate more accurate dense labels, overlooking the limitations imposed by fixed datasets, which can constrain performance improvements. We argue that more diverse trainable images provides WSSS richer information and help model understand more comprehensive semantic pattern. Therefore in this paper, we introduce a novel approach called Image Augmentation Agent (IAA) which shows that it is possible to enhance WSSS from data generation perspective. IAA mainly design an augmentation agent that leverages large language models (LLMs) and diffusion models to automatically generate additional images for WSSS. In practice, to address the instability in prompt generation by LLMs, we develop a prompt self-refinement mechanism. It allow LLMs to re-evaluate the rationality of generated prompts to produce more coherent prompts. Additionally, we insert an online filter into diffusion generation process to dynamically ensure the quality and balance of generated images. Experimental results show that our method significantly surpasses state-of-the-art WSSS approaches on the PASCAL VOC 2012 and MS COCO 2014 datasets.

[164] Spatially-guided Temporal Aggregation for Robust Event-RGB Optical Flow Estimation

Qianang Zhou, Junhui Hou, Meiyi Yang, Yongjian Deng, Youfu Li, Junlin Xiong

Main category: cs.CV

TL;DR: Novel cross-modal fusion framework that uses frame data to guide event data aggregation for optical flow estimation, achieving state-of-the-art performance with improved accuracy and efficiency.

DetailsMotivation: Current optical flow methods don't fully leverage the complementary strengths of frame data (stable appearance) and event data (high-temporal-resolution motion cues), relying on simple stacking rather than effective fusion.

Method: Proposes event-enhanced frame representation, transformer-based fusion module, and mix-fusion encoder to guide event aggregation with frame information and extract comprehensive spatiotemporal features.

Result: Achieves 10% accuracy improvement over event-only models, 4% accuracy gain over state-of-the-art fusion methods, and 45% reduction in inference time on DSEC-Flow dataset.

Conclusion: The framework successfully leverages complementary frame and event modalities through guided fusion, demonstrating superior optical flow estimation performance and efficiency.

Abstract: Current optical flow methods exploit the stable appearance of frame (or RGB) data to establish robust correspondences across time. Event cameras, on the other hand, provide high-temporal-resolution motion cues and excel in challenging scenarios. These complementary characteristics underscore the potential of integrating frame and event data for optical flow estimation. However, most cross-modal approaches fail to fully utilize the complementary advantages, relying instead on simply stacking information. This study introduces a novel approach that uses a spatially dense modality to guide the aggregation of the temporally dense event modality, achieving effective cross-modal fusion. Specifically, we propose an event-enhanced frame representation that preserves the rich texture of frames and the basic structure of events. We use the enhanced representation as the guiding modality and employ events to capture temporally dense motion information. The robust motion features derived from the guiding modality direct the aggregation of motion information from events. To further enhance fusion, we propose a transformer-based module that complements sparse event motion features with spatially rich frame information and enhances global information propagation. Additionally, a mix-fusion encoder is designed to extract comprehensive spatiotemporal contextual features from both modalities. Extensive experiments on the MVSEC and DSEC-Flow datasets demonstrate the effectiveness of our framework. Leveraging the complementary strengths of frames and events, our method achieves leading performance on the DSEC-Flow dataset. Compared to the event-only model, frame guidance improves accuracy by 10%. Furthermore, it outperforms the state-of-the-art fusion-based method with a 4% accuracy gain and a 45% reduction in inference time.

[165] MMHMER:Multi-viewer and Multi-task for Handwritten Mathematical Expression Recognition

Kehua Chen, Haoyang Shen, Lifan Zhong, Mingyi Chen

Main category: cs.CV

TL;DR: Proposes MMHMER, a multi-view multi-task framework combining CNN and Transformer architectures for handwritten math expression recognition, achieving state-of-the-art performance on CROHME datasets.

DetailsMotivation: Existing HMER approaches use either CNN/RNN-GRU or Transformer architectures, each with distinct strengths and weaknesses. There's a need to effectively integrate these complementary capabilities to improve recognition performance.

Method: Efficient CNN-Transformer multi-viewer, multi-task approach that leverages CNN’s feature extraction and Transformer’s sequence modeling capabilities through a novel fusion framework.

Result: Achieved 63.96%, 62.51%, and 65.46% ExpRate on CROHME14, CROHME16, and CROHME19 datasets, outperforming Posformer by 1.28%, 1.48%, and 0.58% absolute gains respectively.

Conclusion: The multi-view multi-task framework successfully integrates CNN and Transformer strengths, demonstrating superior performance in handling handwritten mathematical expression complexity with appropriate computational complexity.

Abstract: Handwritten Mathematical Expression Recognition (HMER) methods have made remarkable progress, with most existing HMER approaches based on either a hybrid CNN/RNN-based with GRU architecture or Transformer architectures. Each of these has its strengths and weaknesses. Leveraging different model structures as viewers and effectively integrating their diverse capabilities presents an intriguing avenue for exploration. This involves addressing two key challenges: 1) How to fuse these two methods effectively, and 2) How to achieve higher performance under an appropriate level of complexity. This paper proposes an efficient CNN-Transformer multi-viewer, multi-task approach to enhance the model’s recognition performance. Our MMHMER model achieves 63.96%, 62.51%, and 65.46% ExpRate on CROHME14, CROHME16, and CROHME19, outperforming Posformer with an absolute gain of 1.28%, 1.48%, and 0.58%. The main contribution of our approach is that we propose a new multi-view, multi-task framework that can effectively integrate the strengths of CNN and Transformer. By leveraging the feature extraction capabilities of CNN and the sequence modeling capabilities of Transformer, our model can better handle the complexity of handwritten mathematical expressions.

[166] Vector-Quantized Vision Foundation Models for Object-Centric Learning

Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen

Main category: cs.CV

TL;DR: VQ-VFM-OCL (VVO) is a unified architecture that uses shared quantization of Vision Foundation Model representations to improve Object-Centric Learning, achieving better performance in object discovery, recognition, and downstream tasks.

DetailsMotivation: Existing methods using Vision Foundation Models (VFMs) for Object-Centric Learning fail to fully exploit VFM potential and struggle with complex object textures in self-supervised reconstruction.

Method: Proposes VQ-VFM-OCL architecture with shared quantization of VFM representations in both aggregation and decoding phases of OCL, creating a unified framework.

Result: Consistently outperforms baselines across different VFMs, aggregators, and decoders in object discovery, recognition, and downstream visual prediction and reasoning tasks.

Conclusion: Shared quantization of VFM representations strengthens OCL supervision and facilitates better aggregation, providing a unified approach that leverages VFMs more effectively for object-centric learning.

Abstract: Object-Centric Learning (OCL) aggregates image or video feature maps into object-level feature vectors, termed \textit{slots}. It’s self-supervision of reconstructing the input from slots struggles with complex object textures, thus Vision Foundation Model (VFM) representations are used as the aggregation input and reconstruction target. Existing methods leverage VFM representations in diverse ways yet fail to fully exploit their potential. In response, we propose a unified architecture, Vector-Quantized VFMs for OCL (VQ-VFM-OCL, or VVO). The key to our unification is simply shared quantizing VFM representations in OCL aggregation and decoding. Experiments show that across different VFMs, aggregators and decoders, our VVO consistently outperforms baselines in object discovery and recognition, as well as downstream visual prediction and reasoning. We also mathematically analyze why VFM representations facilitate OCL aggregation and why their shared quantization as reconstruction targets strengthens OCL supervision. Our source code and model checkpoints are available on https://github.com/Genera1Z/VQ-VFM-OCL.

[167] Towards Vision Zero: The TUM Traffic Accid3nD Dataset

Walter Zimmer, Ross Greer, Daniel Lehmberg, Marc Pavel, Holger Caesar, Xingcheng Zhou, Ahmed Ghita, Mohan Trivedi, Rui Song, Hu Cao, Akshay Gopalkrishnan, Alois C. Knoll

Main category: cs.CV

TL;DR: TUMTraf-Accid3nD dataset provides 3D annotations of real-world highway accidents from roadside cameras and LiDARs, with over 2.6M labeled objects and 111K frames, enabling accident detection research.

DetailsMotivation: Accidents are unavoidable in transportation networks, but no public dataset exists with 3D annotations of real-world accidents from roadside sensors, limiting research on accident understanding and prevention.

Method: Collection of real-world highway accidents recorded from four roadside cameras and LiDARs at 25Hz, with comprehensive 2D/3D bounding boxes, instance masks, and track IDs in OpenLABEL format. Proposed accident detection combines rule-based and learning-based approaches.

Result: Dataset contains 111,945 labeled frames with 2,634,233 labeled objects across six classes, covering various weather and lighting conditions. Experiments show robustness of the proposed accident detection method.

Conclusion: The TUMTraf-Accid3nD dataset fills a critical gap in accident research by providing comprehensive 3D annotations of real-world accidents, enabling development of more effective accident detection and prevention systems.

Abstract: Even though a significant amount of work has been done to increase the safety of transportation networks, accidents still occur regularly. They must be understood as unavoidable and sporadic outcomes of traffic networks. No public dataset contains 3D annotations of real-world accidents recorded from roadside camera and LiDAR sensors. We present the TUM Traffic Accid3nD (TUMTraf-Accid3nD) dataset, a collection of real-world highway accidents in different weather and lighting conditions. It contains vehicle crashes at high-speed driving with 2,634,233 labeled 2D bounding boxes, instance masks, and 3D bounding boxes with track IDs. In total, the dataset contains 111,945 labeled image and point cloud frames recorded from four roadside cameras and LiDARs at 25 Hz. The dataset contains six object classes and is provided in the OpenLABEL format. We propose an accident detection model that combines a rule-based approach with a learning-based one. Experiments and ablation studies on our dataset show the robustness of our proposed method. The dataset, model, and code are available on our website: https://accident-dataset.github.io.

[168] Blending 3D Geometry and Machine Learning for Multi-View Stereopsis

Vibhas Vats, Md. Alimoor Reza, David Crandall, Soon-heung Jung

Main category: cs.CV

TL;DR: GC MVSNet++ integrates multi-view, multi-scale geometric consistency checks during learning to accelerate training and improve 3D reconstruction accuracy, achieving state-of-the-art results on major benchmarks.

DetailsMotivation: Traditional MVS methods rely on photometric/geometric consistency, while learning-based methods only use geometric consistency as post-processing without impacting the learning process itself.

Method: Introduces active geometric consistency enforcement during learning with multi-view, multi-scale supervision, plus a densely connected cost regularization network with simple and feature-dense block designs.

Result: Achieves state-of-the-art on DTU and BlendedMVS datasets, second place on Tanks and Temples benchmark, and halves training iterations compared to other MVS methods.

Conclusion: GC MVSNet++ is the first method to enforce multi-view, multi-scale supervised geometric consistency during learning, significantly improving training efficiency and reconstruction quality.

Abstract: Traditional multi-view stereo (MVS) methods primarily depend on photometric and geometric consistency constraints. In contrast, modern learning-based algorithms often rely on the plane sweep algorithm to infer 3D geometry, applying explicit geometric consistency (GC) checks only as a post-processing step, with no impact on the learning process itself. In this work, we introduce GC MVSNet plus plus, a novel approach that actively enforces geometric consistency of reference view depth maps across multiple source views (multi view) and at various scales (multi scale) during the learning phase (see Fig. 1). This integrated GC check significantly accelerates the learning process by directly penalizing geometrically inconsistent pixels, effectively halving the number of training iterations compared to other MVS methods. Furthermore, we introduce a densely connected cost regularization network with two distinct block designs simple and feature dense optimized to harness dense feature connections for enhanced regularization. Extensive experiments demonstrate that our approach achieves a new state of the art on the DTU and BlendedMVS datasets and secures second place on the Tanks and Temples benchmark. To our knowledge, GC MVSNet plus plus is the first method to enforce multi-view, multi-scale supervised geometric consistency during learning. Our code is available.

[169] AutoComPose: Automatic Generation of Pose Transition Descriptions for Composed Pose Retrieval Using Multimodal LLMs

Yi-Ting Shen, Sungmin Eum, Doheon Lee, Rohit Shete, Chiao-Yi Wang, Heesung Kwon, Shuvra S. Bhattacharyya

Main category: cs.CV

TL;DR: AutoComPose uses MLLMs to automatically generate structured pose transition descriptions for composed pose retrieval, reducing annotation costs while improving performance.

DetailsMotivation: Progress in composed pose retrieval is limited by scarce and inconsistent annotated pose transitions from costly human annotations or heuristic methods.

Method: Leverages multimodal LLMs to generate structured transitions with fine-grained body part movements, mirrored/swapped variations, and cyclic consistency constraints.

Result: Training retrieval models with AutoComPose outperforms human-annotated and heuristic methods, with benchmarks AIST-CPR and PoseFixCPR showing superior performance.

Conclusion: AutoComPose pioneers automatic annotation of pose transitions, establishing a scalable foundation for future CPR research with reduced costs and improved quality.

Abstract: Composed pose retrieval (CPR) enables users to search for human poses by specifying a reference pose and a transition description, but progress in this field is hindered by the scarcity and inconsistency of annotated pose transitions. Existing CPR datasets rely on costly human annotations or heuristic-based rule generation, both of which limit scalability and diversity. In this work, we introduce AutoComPose, the first framework that leverages multimodal large language models (MLLMs) to automatically generate rich and structured pose transition descriptions. Our method enhances annotation quality by structuring transitions into fine-grained body part movements and introducing mirrored/swapped variations, while a cyclic consistency constraint ensures logical coherence between forward and reverse transitions. To advance CPR research, we construct and release two dedicated benchmarks, AIST-CPR and PoseFixCPR, supplementing prior datasets with enhanced attributes. Extensive experiments demonstrate that training retrieval models with AutoComPose yields superior performance over human-annotated and heuristic-based methods, significantly reducing annotation costs while improving retrieval quality. Our work pioneers the automatic annotation of pose transitions, establishing a scalable foundation for future CPR research.

[170] Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction

Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, Andrea Vedaldi

Main category: cs.CV

TL;DR: Geo4D repurposes video diffusion models for monocular 3D reconstruction of dynamic scenes, using synthetic training data that generalizes to real data, outperforming state-of-the-art methods.

DetailsMotivation: To leverage large-scale pre-trained video models' dynamic priors for accurate 4D reconstruction of dynamic scenes from monocular video, overcoming limitations of existing video depth estimation methods.

Method: Predicts multiple geometric modalities (point, disparity, and ray maps), uses multi-modal alignment algorithm to fuse them, and employs sliding window approach for long video reconstruction.

Result: Significantly surpasses state-of-the-art video depth estimation methods across multiple benchmarks, demonstrating robust and accurate 4D reconstruction.

Conclusion: Geo4D effectively repurposes video diffusion models for dynamic scene reconstruction, showing strong generalization from synthetic to real data with superior performance over existing methods.

Abstract: We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes. By leveraging the strong dynamic priors captured by large-scale pre-trained video models, Geo4D can be trained using only synthetic data while generalizing well to real data in a zero-shot manner. Geo4D predicts several complementary geometric modalities, namely point, disparity, and ray maps. We propose a new multi-modal alignment algorithm to align and fuse these modalities, as well as a sliding window approach at inference time, thus enabling robust and accurate 4D reconstruction of long videos. Extensive experiments across multiple benchmarks show that Geo4D significantly surpasses state-of-the-art video depth estimation methods.

[171] DNF-Avatar: Distilling Neural Fields for Real-time Animatable Avatar Relighting

Zeren Jiang, Shaofei Wang, Siyu Tang

Main category: cs.CV

TL;DR: Distilling neural field knowledge to Gaussian splatting for real-time relightable human avatars from monocular videos

DetailsMotivation: Existing methods using neural fields with physically based rendering suffer from slow rendering speeds due to expensive Monte Carlo ray tracing, limiting real-time applications

Method: Knowledge distillation from implicit neural fields (teacher) to explicit 2D Gaussian splatting (student) representation, using split-sum approximation for PBR appearance and novel part-wise ambient occlusion probes for shadow computation

Result: Achieves comparable or better relighting results than teacher model while being 370 times faster at inference time, reaching 67 FPS rendering speed

Conclusion: The proposed approach enables high-quality real-time relighting of human avatars with realistic shadow effects, making it suitable for VR, sports, and gaming applications

Abstract: Creating relightable and animatable human avatars from monocular videos is a rising research topic with a range of applications, e.g. virtual reality, sports, and video games. Previous works utilize neural fields together with physically based rendering (PBR), to estimate geometry and disentangle appearance properties of human avatars. However, one drawback of these methods is the slow rendering speed due to the expensive Monte Carlo ray tracing. To tackle this problem, we proposed to distill the knowledge from implicit neural fields (teacher) to explicit 2D Gaussian splatting (student) representation to take advantage of the fast rasterization property of Gaussian splatting. To avoid ray-tracing, we employ the split-sum approximation for PBR appearance. We also propose novel part-wise ambient occlusion probes for shadow computation. Shadow prediction is achieved by querying these probes only once per pixel, which paves the way for real-time relighting of avatars. These techniques combined give high-quality relighting results with realistic shadow effects. Our experiments demonstrate that the proposed student model achieves comparable or even better relighting results with our teacher model while being 370 times faster at inference time, achieving a 67 FPS rendering speed.

[172] EmoSEM: Segment and Explain Emotion Stimuli in Visual Art

Jing Zhang, Dan Guo, Zhangbin Li, Meng Wang

Main category: cs.CV

TL;DR: EmoSEM model for pixel-level emotion segmentation and explanation in art images, addressing subjectivity of emotion and abstract art expression through emotional prompts, emotion projectors, and prefix adapters.

DetailsMotivation: Current segmentation models struggle with emotion-oriented tasks due to emotion subjectivity, and captioning models fail to balance pixel-level semantics with emotion reasoning in abstract art expressions.

Method: Proposes EmoSEM with emotional prompt with learnable mask token, emotion projector for emotion-visual association, lightweight prefix adapter for emotion-mask fusion, and joint visual-mask-emotion tokens for language model explanation generation.

Result: End-to-end modeling from pixel features to emotion interpretation, delivering first interpretable fine-grained framework for visual emotion analysis with validated effectiveness through extensive experiments.

Conclusion: Successfully addresses dual challenges of emotion subjectivity and abstract art expression, enabling precise emotion-triggering region segmentation and coherent emotional explanations in art images.

Abstract: This paper focuses on a key challenge in visual emotion understanding: given an art image, the model pinpoints pixel regions that trigger a specific human emotion, and generates linguistic explanations for it. Despite advances in general segmentation, pixel-level emotion understanding still faces a dual challenge: first, the subjectivity of emotion limits general segmentation models like SAM to adapt to emotion-oriented segmentation tasks; and second, the abstract nature of art expression makes it hard for captioning models to balance pixel-level semantics and emotion reasoning. To solve the above problems, this paper proposes the Emotion stimuli Segmentation and Explanation Model (EmoSEM) model to endow the segmentation framework with emotion comprehension capability. First, to enable the model to perform segmentation under the guidance of emotional intent well, we introduce an emotional prompt with a learnable mask token as the conditional input for segmentation decoding. Then, we design an emotion projector to establish the association between emotion and visual features. Next, more importantly, to address emotion-visual stimuli alignment, we develop a lightweight prefix adapter, a module that fuses the learned emotional mask with the corresponding emotion into a unified representation compatible with the language model. Finally, we input the joint visual, mask, and emotional tokens into the language model and output the emotional explanations. It ensures that the generated interpretations remain semantically and emotionally coherent with the visual stimuli. Our method realizes end-to-end modeling from low-level pixel features to high-level emotion interpretation, delivering the first interpretable fine-grained framework for visual emotion analysis. Extensive experiments validate the effectiveness of our model. Code will be made publicly available.

[173] Beyond the Horizon: Decoupling Multi-View UAV Action Recognition via Partial Order Transfer

Wenxuan Liu, Zhuo Zhou, Xuemei Jia, Siyuan Yang, Wenxin Huang, Xian Zhong, Chia-Wen Lin

Main category: cs.CV

TL;DR: POG-MVNet addresses UAV action recognition challenges by modeling hierarchical view structures across altitudes, using partial order guidance to transfer knowledge from easier to harder views, achieving state-of-the-art performance.

DetailsMotivation: UAV action recognition faces unique challenges due to significant view variations along vertical spatial axis, with recognition accuracy decreasing as altitude increases, creating a partial order among views.

Method: Proposes POG-MVNet with three modules: View Partition (groups views by altitude using head-to-body ratio), Order-aware Feature Decoupling (disentangles action-relevant and view-specific features), and Action Partial Order Guide (transfers knowledge from easier to more challenging views).

Result: Achieves 4.7% improvement on Drone-Action and 3.5% improvement on UAV dataset compared to state-of-the-art methods ASAT and FAR.

Conclusion: Explicitly modeling the hierarchical structure of UAV views through partial order guidance effectively addresses drastic view variations and improves recognition performance across different altitudes.

Abstract: Action recognition in unmanned aerial vehicles (UAVs) poses unique challenges due to significant view variations along the vertical spatial axis. Unlike traditional ground-based settings, UAVs capture actions at a wide range of altitudes, resulting in considerable appearance discrepancies. We introduce a multi-view formulation tailored to varying UAV altitudes and empirically observe a partial order among views, where recognition accuracy consistently decreases as altitude increases. This observation motivates a novel approach that explicitly models the hierarchical structure of UAV views to improve recognition performance across altitudes. To this end, we propose the Partial Order Guided Multi-View Network (POG-MVNet), designed to address drastic view variations by effectively leveraging view-dependent information across different altitude levels. The framework comprises three key components: a View Partition (VP) module, which uses the head-to-body ratio to group views by altitude; an Order-aware Feature Decoupling (OFD) module, which disentangles action-relevant and view-specific features under partial order guidance; and an Action Partial Order Guide (APOG), which uses the partial order to transfer informative knowledge from easier views to more challenging ones. We conduct experiments on Drone-Action, MOD20, and UAV, demonstrating that POG-MVNet significantly outperforms competing methods. For example, POG-MVNet achieves a 4.7% improvement on Drone-Action and a 3.5% improvement on UAV compared to state-of-the-art methods ASAT and FAR. Code will be released soon.

[174] Enhancing Visual Reliance in Text Generation: A Bayesian Perspective on Mitigating Hallucination in Large Vision-Language Models

Nanxing Hu, Xiaoyue Duan, Jinchao Zhang, Guoliang Kang

Main category: cs.CV

TL;DR: A Bayesian approach to mitigate hallucinations in Large Vision-Language Models by removing redundant visual tokens, rectifying prior information, and stopping generation when visual reliance collapses.

DetailsMotivation: LVLMs often generate texts that don't match visual inputs (hallucination), limiting real-world applicability. Previous works don't systematically enhance visual reliance in text generation.

Method: Three-pronged Bayesian approach: 1) Remove redundant visual tokens, 2) Rectify inappropriate prior information, 3) Stop generation when posterior collapses to prior distribution independent of visual tokens.

Result: Extensive experiments on POPE, CHAIR, and MME benchmarks show consistent mitigation of hallucination issues and favorable performance against previous state-of-the-art methods.

Conclusion: The proposed Bayesian framework effectively addresses visual reliance degradation in LVLMs and provides a systematic solution to reduce hallucinations in text generation.

Abstract: Large Vision-Language Models (LVLMs) usually generate texts which satisfy context coherence but don’t match the visual input. Such a hallucination issue hinders LVLMs’ applicability in the real world. The key to solving hallucination in LVLM is to make the text generation rely more on the visual content. Most previous works choose to enhance/adjust the features/output of a specific modality (i.e., visual or textual) to alleviate hallucinations in LVLM, which do not explicitly or systematically enhance the visual reliance. In this paper, we comprehensively investigate the factors which may degenerate the visual reliance in text generation of LVLM from a Bayesian perspective. Based on our observations, we propose to mitigate hallucination in LVLM from three aspects. Firstly, we observe that not all visual tokens are informative in generating meaningful texts. We propose to evaluate and remove redundant visual tokens to avoid their disturbance. Secondly, LVLM may encode inappropriate prior information, making it lean toward generating unexpected words. We propose a simple yet effective way to rectify the prior from a Bayesian perspective. Thirdly, we observe that starting from certain steps, the posterior of next-token prediction conditioned on visual tokens may collapse to a prior distribution which does not depend on any informative visual tokens at all. Thus, we propose to stop further text generation to avoid hallucination. Extensive experiments on three benchmarks including POPE, CHAIR, and MME demonstrate that our method can consistently mitigate the hallucination issue of LVLM and performs favorably against previous state-of-the-arts.

[175] ReservoirTTA: Prolonged Test-time Adaptation for Evolving and Recurring Domains

Guillaume Vray, Devavrat Tomar, Xufeng Gao, Jean-Philippe Thiran, Evan Shelhamer, Behzad Bozorgtabar

Main category: cs.CV

TL;DR: ReservoirTTA is a plug-in framework for prolonged test-time adaptation that uses a reservoir of domain-specialized models to handle continuously shifting test domains, preventing catastrophic forgetting and maintaining stable performance.

DetailsMotivation: To address limitations of single-model adaptation in scenarios where test domains continuously shift over time, including recurring or gradually evolving domains, which suffer from catastrophic forgetting, inter-domain interference, and error accumulation.

Method: Maintains a reservoir of domain-specialized models that detect new domains via online clustering over style features and route samples to appropriate specialized models for domain-specific adaptation.

Result: Significantly improves adaptation accuracy and maintains stable performance across prolonged, recurring domain shifts on ImageNet-C, CIFAR-10/100-C, and Cityscapes→ACDC semantic segmentation, outperforming state-of-the-art methods.

Conclusion: ReservoirTTA provides a robust solution for prolonged test-time adaptation with theoretical guarantees on parameter variance and model collapse prevention, effectively handling continuous domain shifts while mitigating catastrophic forgetting.

Abstract: This paper introduces ReservoirTTA, a novel plug-in framework designed for prolonged test-time adaptation (TTA) in scenarios where the test domain continuously shifts over time, including cases where domains recur or evolve gradually. At its core, ReservoirTTA maintains a reservoir of domain-specialized models – an adaptive test-time model ensemble – that both detects new domains via online clustering over style features of incoming samples and routes each sample to the appropriate specialized model, and thereby enables domain-specific adaptation. This multi-model strategy overcomes key limitations of single model adaptation, such as catastrophic forgetting, inter-domain interference, and error accumulation, ensuring robust and stable performance on sustained non-stationary test distributions. Our theoretical analysis reveals key components that bound parameter variance and prevent model collapse, while our plug-in TTA module mitigates catastrophic forgetting of previously encountered domains. Extensive experiments on the classification corruption benchmarks, including ImageNet-C and CIFAR-10/100-C, as well as the Cityscapes$\rightarrow$ACDC semantic segmentation task, covering recurring and continuously evolving domain shifts, demonstrate that ReservoirTTA significantly improves adaptation accuracy and maintains stable performance across prolonged, recurring shifts, outperforming state-of-the-art methods. Our code is publicly available at https://github.com/LTS5/ReservoirTTA.

[176] MR-EEGWaveNet: Multiresolutional EEGWaveNet for Seizure Detection from Long EEG Recordings

Kazi Mahmudul Hassan, Xuyang Zhao, Hidenori Sugano, Toshihisa Tanaka

Main category: cs.CV

TL;DR: Proposed MR-EEGWaveNet model for seizure detection that captures temporal and spatial EEG features through multiresolution analysis, significantly outperforming conventional methods with improved F1 scores and precision.

DetailsMotivation: Feature engineering for generalized seizure detection remains challenging, with existing models showing variable performance and inability to accurately distinguish artifacts from seizure data.

Method: End-to-end model with three modules: convolution (depth-wise and spatio-temporal convolution), feature extraction (dimensionality reduction of EEG segments and sub-segments), and predictor (fully connected classifier). Includes anomaly score-based post-processing to reduce false positives.

Result: Significantly outperformed conventional non-multiresolution approach, improving F1 scores from 0.177 to 0.336 on Siena dataset and 0.327 to 0.488 on Juntendo dataset, with precision gains of 15.9% and 20.62% respectively.

Conclusion: MR-EEGWaveNet effectively captures both temporal dependencies and spatial relationships in EEG data, providing superior seizure detection performance while reducing false positives through multiresolution analysis and post-processing techniques.

Abstract: Feature engineering for generalized seizure detection models remains a significant challenge. Recently proposed models show variable performance depending on the training data and remain ineffective at accurately distinguishing artifacts from seizure data. In this study, we propose a novel end-to-end model, “Multiresolutional EEGWaveNet (MR-EEGWaveNet),” which efficiently distinguishes seizure events from background electroencephalogram (EEG) and artifacts/noise by capturing both temporal dependencies across different time frames and spatial relationships between channels. The model has three modules: convolution, feature extraction, and predictor. The convolution module extracts features through depth-wise and spatio-temporal convolution. The feature extraction module individually reduces the feature dimension extracted from EEG segments and their sub-segments. Subsequently, the extracted features are concatenated into a single vector for classification using a fully connected classifier called the predictor module. In addition, an anomaly score-based post-classification processing technique is introduced to reduce the false-positive rates of the model. Experimental results are reported and analyzed using different parameter settings and datasets (Siena (public) and Juntendo (private)). The proposed MR-EEGWaveNet significantly outperformed the conventional non-multiresolution approach, improving the F1 scores from 0.177 to 0.336 on Siena and 0.327 to 0.488 on Juntendo, with precision gains of 15.9% and 20.62%, respectively.

[177] LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning

Chenjian Gao, Lihe Ding, Xin Cai, Zhanpeng Huang, Zibin Wang, Tianfan Xue

Main category: cs.CV

TL;DR: A mask-based LoRA tuning method that adapts pretrained I2V models for flexible video editing using spatiotemporal masks to guide content preservation and generation.

DetailsMotivation: Current video editing methods rely on large-scale pretraining and lack flexibility for specific edits. First-frame-guided editing provides limited control over subsequent frames.

Method: Proposes a mask-based LoRA (Low-Rank Adaptation) tuning method with spatiotemporal masks to guide fine-tuning. The method teaches the model to interpret masks for content preservation/generation and synthesize temporally consistent motion or novel appearances.

Result: Experimental results show superior video editing performance compared to baseline methods, enabling complex transformations like object rotation or flower blooming.

Conclusion: The dual-capability LoRA approach provides users with control over the entire temporal evolution of edits, achieving flexible and high-quality video editing without extensive pretraining.

Abstract: Video editing using diffusion models has achieved remarkable results in generating high-quality edits for videos. However, current methods often rely on large-scale pretraining, limiting flexibility for specific edits. First-frame-guided editing provides control over the first frame, but lacks flexibility over subsequent frames. To address this, we propose a mask-based LoRA (Low-Rank Adaptation) tuning method that adapts pretrained Image-to-Video (I2V) models for flexible video editing. Our key innovation is using a spatiotemporal mask to strategically guide the LoRA fine-tuning process. This teaches the model two distinct skills: first, to interpret the mask as a command to either preserve content from the source video or generate new content in designated regions. Second, for these generated regions, LoRA learns to synthesize either temporally consistent motion inherited from the video or novel appearances guided by user-provided reference frames. This dual-capability LoRA grants users control over the edit’s entire temporal evolution, allowing complex transformations like an object rotating or a flower blooming. Experimental results show our method achieves superior video editing performance compared to baseline methods. Project Page: https://cjeen.github.io/LoRAEdit

[178] MCN-SLAM: Multi-Agent Collaborative Neural SLAM with Hybrid Implicit Neural Scene Representation

Tianchen Deng, Guole Shen, Xun Chen, Shenghai Yuan, Hongming Shen, Guohao Peng, Zhenyu Wu, Jingchuan Wang, Lihua Xie, Danwei Wang, Hesheng Wang, Weidong Chen

Main category: cs.CV

TL;DR: First distributed multi-agent collaborative neural SLAM framework with hybrid scene representation, intra-to-inter loop closure, and online distillation for submap fusion, plus a new real-world dataset with ground truth.

DetailsMotivation: Existing implicit SLAM algorithms are limited to single-agent scenarios and struggle with large-scale scenes and long sequences. Current NeRF-based multi-agent SLAM frameworks cannot meet communication bandwidth constraints.

Method: Proposes hybrid triplane-grid joint scene representation, distributed camera tracking, intra-to-inter loop closure for local/global consistency, and online distillation for submap fusion. Also creates a new real-world dataset with continuous-time trajectories and high-accuracy 3D meshes ground truth.

Result: Experiments demonstrate superiority in mapping, tracking, and communication efficiency compared to existing methods.

Conclusion: The proposed framework effectively addresses multi-agent SLAM challenges with novel representation and fusion techniques, and the new dataset advances research in SLAM, 3D reconstruction, and visual foundation models.

Abstract: Neural implicit scene representations have recently shown promising results in dense visual SLAM. However, existing implicit SLAM algorithms are constrained to single-agent scenarios, and fall difficulties in large-scale scenes and long sequences. Existing NeRF-based multi-agent SLAM frameworks cannot meet the constraints of communication bandwidth. To this end, we propose the first distributed multi-agent collaborative neural SLAM framework with hybrid scene representation, distributed camera tracking, intra-to-inter loop closure, and online distillation for multiple submap fusion. A novel triplane-grid joint scene representation method is proposed to improve scene reconstruction. A novel intra-to-inter loop closure method is designed to achieve local (single-agent) and global (multi-agent) consistency. We also design a novel online distillation method to fuse the information of different submaps to achieve global consistency. Furthermore, to the best of our knowledge, there is no real-world dataset for NeRF-based/GS-based SLAM that provides both continuous-time trajectories groundtruth and high-accuracy 3D meshes groundtruth. To this end, we propose the first real-world Dense slam (DES) dataset covering both single-agent and multi-agent scenarios, ranging from small rooms to large-scale outdoor scenes, with high-accuracy ground truth for both 3D mesh and continuous-time camera trajectory. This dataset can advance the development of the research in both SLAM, 3D reconstruction, and visual foundation model. Experiments on various datasets demonstrate the superiority of the proposed method in both mapping, tracking, and communication. The dataset and code will open-source on https://github.com/dtc111111/mcnslam.

[179] Segment Anything in Pathology Images with Natural Language

Zhixuan Chen, Junlin Hou, Liqi Lin, Yihui Wang, Yequan Bie, Xi Wang, Yanning Zhou, Ronald Cheong Kin Chan, Hao Chen

Main category: cs.CV

TL;DR: PathSegmentor is a text-prompted segmentation foundation model for pathology images that outperforms existing methods and uses natural language instead of spatial inputs.

DetailsMotivation: Current pathology image segmentation methods face challenges due to limited annotated data and restricted category definitions, limiting clinical applications.

Method: Proposed PathSegmentor foundation model using text prompts for segmentation, built on PathSeg dataset with 275k image-mask-label triples across 160 categories from 21 public sources.

Result: Outperforms specialized models by 0.145 and 0.429 in Dice scores, shows strong robustness for complex structures, and generalizes well to external datasets.

Conclusion: PathSegmentor advances explainable AI in precision oncology by providing evidence-based support for clinical decision-making through improved interpretability and biomarker discovery.

Abstract: Pathology image segmentation is crucial in computational pathology for analyzing histological features relevant to cancer diagnosis and prognosis. However, current methods face major challenges in clinical applications due to limited annotated data and restricted category definitions. To address these limitations, we propose PathSegmentor, the first text-prompted segmentation foundation model designed specifically for pathology images. We also introduce PathSeg, the largest and most comprehensive dataset for pathology segmentation, built from 21 public sources and containing 275k image-mask-label triples across 160 diverse categories. With PathSegmentor, users can perform semantic segmentation using natural language prompts, eliminating the need for laborious spatial inputs such as points or boxes. Extensive experiments demonstrate that PathSegmentor outperforms specialized models with higher accuracy and broader applicability, while maintaining a compact architecture. It significantly surpasses existing spatial- and text-prompted models by 0.145 and 0.429 in overall Dice scores, respectively, showing strong robustness in segmenting complex structures and generalizing to external datasets. Moreover, PathSegmentor’s outputs enhance the interpretability of diagnostic models through feature importance estimation and imaging biomarker discovery, offering pathologists evidence-based support for clinical decision-making. This work advances the development of explainable AI in precision oncology.

[180] FreqDGT: Frequency-Adaptive Dynamic Graph Networks with Transformer for Cross-subject EEG Emotion Recognition

Yueyang Li, Shengyu Gong, Weiming Zeng, Nizhuan Wang, Wai Ting Siok

Main category: cs.CV

TL;DR: FreqDGT is a frequency-adaptive dynamic graph transformer that improves cross-subject EEG emotion recognition by dynamically weighting emotion-relevant frequency bands, learning input-specific brain connectivity patterns, and capturing temporal dynamics with adversarial feature disentanglement.

DetailsMotivation: Cross-subject generalization remains a fundamental challenge in EEG-based emotion recognition due to individual variability, cognitive traits, and emotional responses. Current methods struggle with the high inter-subject variability in EEG signals.

Method: FreqDGT integrates three key components: frequency-adaptive processing (FAP) to dynamically weight emotion-relevant frequency bands, adaptive dynamic graph learning (ADGL) to learn input-specific brain connectivity patterns, and multi-scale temporal disentanglement network (MTDN) that combines hierarchical temporal transformers with adversarial feature disentanglement.

Result: Comprehensive experiments demonstrate that FreqDGT significantly improves cross-subject emotion recognition accuracy compared to existing methods.

Conclusion: The proposed framework effectively integrates frequency-adaptive, spatial-dynamic, and temporal-hierarchical modeling while ensuring robustness to individual differences, providing a comprehensive solution for cross-subject EEG emotion recognition.

Abstract: Electroencephalography (EEG) serves as a reliable and objective signal for emotion recognition in affective brain-computer interfaces, offering unique advantages through its high temporal resolution and ability to capture authentic emotional states that cannot be consciously controlled. However, cross-subject generalization remains a fundamental challenge due to individual variability, cognitive traits, and emotional responses. We propose FreqDGT, a frequency-adaptive dynamic graph transformer that systematically addresses these limitations through an integrated framework. FreqDGT introduces frequency-adaptive processing (FAP) to dynamically weight emotion-relevant frequency bands based on neuroscientific evidence, employs adaptive dynamic graph learning (ADGL) to learn input-specific brain connectivity patterns, and implements multi-scale temporal disentanglement network (MTDN) that combines hierarchical temporal transformers with adversarial feature disentanglement to capture both temporal dynamics and ensure cross-subject robustness. Comprehensive experiments demonstrate that FreqDGT significantly improves cross-subject emotion recognition accuracy, confirming the effectiveness of integrating frequency-adaptive, spatial-dynamic, and temporal-hierarchical modeling while ensuring robustness to individual differences. The code is available at https://github.com/NZWANG/FreqDGT.

[181] Stereo-based 3D Anomaly Object Detection for Autonomous Driving: A New Dataset and Baseline

Shiyi Mu, Zichong Gu, Hanqi Lyu, Yilin Gao, Shugong Xu

Main category: cs.CV

TL;DR: S3AD algorithm decouples 2D-3D training for better 3D anomaly detection, creates KITTI-AR dataset with 97 new categories to test generalization in autonomous driving scenarios.

DetailsMotivation: 3D detection models trained on closed sets often fail to detect rare anomaly objects on roads, posing risks for autonomous driving applications that need to handle arbitrary-shaped targets.

Method: Proposes S3AD algorithm with decoupled 2D-3D training strategy and anomaly scoring based on foreground confidence prediction. Creates KITTI-AR dataset with 97 new categories using 3D rendering for enhanced testing.

Result: Developed enhanced 3D anomaly detection capability with target-level anomaly scoring. Created comprehensive dataset (6k stereo image pairs) to validate generalization performance in zero-shot scenarios.

Conclusion: The proposed S3AD algorithm and KITTI-AR dataset effectively address 3D anomaly detection challenges, improving generalization for rare object categories in autonomous driving applications.

Abstract: 3D detection technology is widely used in the field of autonomous driving, with its application scenarios gradually expanding from enclosed highways to open conventional roads. For rare anomaly categories that appear on the road, 3D detection models trained on closed sets often misdetect or fail to detect anomaly objects. To address this risk, it is necessary to enhance the generalization ability of 3D detection models for targets of arbitrary shapes and to possess the capability to filter out anomalies. The generalization of 3D detection is limited by two factors: the coupled training of 2D and 3D, and the insufficient diversity in the scale distribution of training samples. This paper proposes a Stereo-based 3D Anomaly object Detection (S3AD) algorithm, which decouples the training strategy of 3D and 2D to release the generalization ability for arbitrary 3D foreground detection, and proposes an anomaly scoring algorithm based on foreground confidence prediction, achieving target-level anomaly scoring. In order to further verify and enhance the generalization of anomaly detection, we use a 3D rendering method to synthesize two augmented reality binocular stereo 3D detection datasets which named KITTI-AR. KITTI-AR extends upon KITTI by adding 97 new categories, totaling 6k pairs of stereo images. The KITTI-AR-ExD subset includes 39 common categories as extra training data to address the sparse sample distribution issue. Additionally, 58 rare categories form the KITTI-AR-OoD subset, which are not used in training to simulate zero-shot scenarios in real-world settings, solely for evaluating 3D anomaly detection. Finally, the performance of the algorithm and the dataset is verified in the experiments. (Code and dataset can be obtained at https://github.com/shiyi-mu/S3AD-Code).

[182] Slot Attention with Re-Initialization and Self-Distillation

Rongzhen Zhao, Yi Zhao, Juho Kannala, Joni Pajarinen

Main category: cs.CV

TL;DR: DIAS improves Slot Attention by reducing slot redundancy through re-initialization and adding self-distillation from attention maps, achieving SOTA performance in object discovery and recognition.

DetailsMotivation: Standard Slot Attention reuses slots naively, causing redundant slots to compete with informative ones and resulting in erroneous object segmentation. Current methods also lack supervision from internal information beyond reconstruction.

Method: Slot Attention with re-Initialization and self-Distillation (DIAS): 1) Reduces redundancy in aggregated slots and re-initializes extra aggregation to update remaining slots 2) Drives bad attention maps from first iteration to approximate good ones from last iteration for self-distillation

Result: DIAS achieves state-of-the-art performance on object discovery and recognition tasks, and also improves advanced visual prediction and reasoning capabilities.

Conclusion: The proposed DIAS framework effectively addresses slot redundancy and lack of internal supervision in Object-Centric Learning, demonstrating superior performance across multiple visual tasks.

Abstract: Unlike popular solutions based on dense feature maps, Object-Centric Learning (OCL) represents visual scenes as sub-symbolic object-level feature vectors, termed slots, which are highly versatile for tasks involving visual modalities. OCL typically aggregates object superpixels into slots by iteratively applying competitive cross attention, known as Slot Attention, with the slots as the query. However, once initialized, these slots are reused naively, causing redundant slots to compete with informative ones for representing objects. This often results in objects being erroneously segmented into parts. Additionally, mainstream methods derive supervision signals solely from decoding slots into the input’s reconstruction, overlooking potential supervision based on internal information. To address these issues, we propose Slot Attention with re-Initialization and self-Distillation (DIAS): $\emph{i)}$ We reduce redundancy in the aggregated slots and re-initialize extra aggregation to update the remaining slots; $\emph{ii)}$ We drive the bad attention map at the first aggregation iteration to approximate the good at the last iteration to enable self-distillation. Experiments demonstrate that DIAS achieves state-of-the-art on OCL tasks like object discovery and recognition, while also improving advanced visual prediction and reasoning. Our source code and model checkpoints are available on https://github.com/Genera1Z/DIAS.

[183] Dataset Condensation with Color Compensation

Huyu Wu, Duo Su, Junjie Hou, Guang Li

Main category: cs.CV

TL;DR: DC3 is a dataset condensation framework that addresses color diversity issues in compressed datasets using color compensation via latent diffusion models, achieving state-of-the-art performance without semantic distortion.

DetailsMotivation: Existing dataset condensation methods suffer from inefficiency (image-level selection) or semantic distortion (pixel-level optimization), with a critical oversight of color's dual role as both information carrier and semantic representation unit.

Method: DC3 uses a calibrated selection strategy followed by latent diffusion model to enhance color diversity of images rather than creating new ones, focusing on color compensation.

Result: Outperforms SOTA methods across multiple benchmarks, demonstrates superior performance and generalization, and enables fine-tuning pre-trained diffusion models with condensed datasets without model collapse.

Conclusion: Improving colorfulness in condensed images benefits representation learning, and DC3 provides high-quality datasets feasible for training networks without degradation issues.

Abstract: Dataset condensation always faces a constitutive trade-off: balancing performance and fidelity under extreme compression. Existing methods struggle with two bottlenecks: image-level selection methods (Coreset Selection, Dataset Quantization) suffer from inefficiency condensation, while pixel-level optimization (Dataset Distillation) introduces semantic distortion due to over-parameterization. With empirical observations, we find that a critical problem in dataset condensation is the oversight of color’s dual role as an information carrier and a basic semantic representation unit. We argue that improving the colorfulness of condensed images is beneficial for representation learning. Motivated by this, we propose DC3: a Dataset Condensation framework with Color Compensation. After a calibrated selection strategy, DC3 utilizes the latent diffusion model to enhance the color diversity of an image rather than creating a brand-new one. Extensive experiments demonstrate the superior performance and generalization of DC3 that outperforms SOTA methods across multiple benchmarks. To the best of our knowledge, besides focusing on downstream tasks, DC3 is the first research to fine-tune pre-trained diffusion models with condensed datasets. The FID results prove that training networks with our high-quality datasets is feasible without model collapse or other degradation issues. Code and generated data are available at https://github.com/528why/Dataset-Condensation-with-Color-Compensation.

[184] SBP-YOLO:A Lightweight Real-Time Model for Detecting Speed Bumps and Potholes

Chuanqi Liang, Jie Fu, Miao Yu, Lei Luo

Main category: cs.CV

TL;DR: SBP-YOLO is a lightweight detection framework that achieves 87.0% mAP and 139.5 FPS on embedded platforms for road bump and pothole detection, outperforming YOLOv11n by 5.8% with optimized architecture and training strategies.

DetailsMotivation: Reliable real-time detection of road speed bumps and potholes is crucial for advanced suspension systems, but challenging due to limited computational resources on embedded platforms and small target sizes.

Method: Based on YOLOv11n, integrates GhostConv and VoVGSCSPC modules, adds P2-level branch with lightweight detection head (LEDH), and uses hybrid training with NWD loss, knowledge distillation, and Albumentations augmentation.

Result: Achieves 87.0% mAP (5.8% improvement over baseline), runs at 139.5 FPS on Jetson AGX Xavier after TensorRT FP16 quantization, with 12.4% speedup over P2-enhanced YOLOv11.

Conclusion: SBP-YOLO effectively enables fast, low-latency road condition perception for embedded suspension control systems with high accuracy and efficiency.

Abstract: Reliable and real-time detection of road speed bumps and potholes is crucial for anticipatory perception in advanced suspension systems, enabling timely and adaptive damping control. Achieving high accuracy and efficiency on embedded platforms remains challenging due to limited computational resources and the small scale of distant targets. This paper presents SBP-YOLO, a lightweight and high-speed detection framework tailored for bump and pothole recognition. Based on YOLOv11n, the model integrates GhostConv and VoVGSCSPC modules into the backbone and neck to reduce computation while enhancing multi-scale semantic features. To improve small-object detection, a P2-level branch is introduced with a lightweight and efficient detection head LEDH mitigating the added computational overhead without compromising accuracy. A hybrid training strategy combining NWD loss, backbone-level knowledge distillation, and Albumentations-driven augmentation further enhances localization precision and robustness. Experiments show that SBP-YOLO achieves 87.0 percent mAP, outperforming the YOLOv11n baseline by 5.8 percent. After TensorRT FP16 quantization, it runs at 139.5 FPS on Jetson AGX Xavier, delivering a 12.4 percent speedup over the P2-enhanced YOLOv11. These results validate the effectiveness of the proposed method for fast and low-latency road condition perception in embedded suspension control systems.

[185] SlotMatch: Distilling Temporally Consistent Object-Centric Representations for Unsupervised Video Segmentation

Diana-Nicoleta Grigore, Neelu Madan, Andreas Mogelmose, Thomas B. Moeslund, Radu Tudor Ionescu

Main category: cs.CV

TL;DR: SlotMatch is a simple knowledge distillation framework that transfers object-centric representations from large teacher models to lightweight students for unsupervised video segmentation, achieving better performance with fewer parameters and faster speed.

DetailsMotivation: Unsupervised video segmentation is challenging due to lack of supervision and complex scenes. Current state-of-the-art models require large, computationally expensive architectures, creating a need for more efficient solutions.

Method: Proposes SlotMatch framework that aligns teacher and student slots via cosine similarity without additional distillation objectives or auxiliary supervision. Uses knowledge distillation to transfer representations from large teacher (SlotContrast) to lightweight student.

Result: Student model matches and even outperforms teacher (SlotContrast) while using 3.6x less parameters and running 1.9x faster. Also surpasses previous unsupervised video segmentation models on two datasets.

Conclusion: SlotMatch provides an effective and simple knowledge distillation approach for unsupervised video segmentation that eliminates the need for additional losses while achieving superior efficiency and performance.

Abstract: Unsupervised video segmentation is a challenging computer vision task, especially due to the lack of supervisory signals coupled with the complexity of visual scenes. To overcome this challenge, state-of-the-art models based on slot attention often have to rely on large and computationally expensive neural architectures. To this end, we propose a simple knowledge distillation framework that effectively transfers object-centric representations to a lightweight student. The proposed framework, called SlotMatch, aligns corresponding teacher and student slots via the cosine similarity, requiring no additional distillation objectives or auxiliary supervision. The simplicity of SlotMatch is confirmed via theoretical and empirical evidence, both indicating that integrating additional losses is redundant. We conduct experiments on two datasets to compare the state-of-the-art teacher model, SlotContrast, with our distilled student. The results show that our student based on SlotMatch matches and even outperforms its teacher, while using 3.6x less parameters and running 1.9x faster. Moreover, our student surpasses previous unsupervised video segmentation models.

[186] Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder

Jingchao Wang, Zhijian Wu, Dingjiang Huang, Yefeng Zheng, Hong Wang

Main category: cs.CV

TL;DR: MLLMSeg is a lightweight RES framework that leverages MLLM’s vision encoder features without SAM, achieving better performance-cost balance than existing methods.

DetailsMotivation: Address the trade-off between performance and computational cost in Reference Expression Segmentation, as current methods either use heavy SAM (632M params) or sacrifice accuracy with lightweight approaches.

Method: Proposes MLLMSeg framework that exploits MLLM vision encoder features, uses detail-enhanced semantic-consistent feature fusion (DSFF), and implements lightweight mask decoder (34M params).

Result: Extensive experiments show MLLMSeg surpasses both SAM-based and SAM-free competitors while maintaining better performance-cost balance.

Conclusion: The method successfully demonstrates that MLLM’s inherent visual features can be effectively utilized for dense prediction tasks without additional visual encoders, achieving state-of-the-art results with significantly reduced parameters.

Abstract: Reference Expression Segmentation (RES) aims to segment image regions specified by referring expressions and has become popular with the rise of multimodal large models (MLLMs). While MLLMs excel in semantic understanding, their token-generation paradigm struggles with pixel-level dense prediction. Existing RES methods either couple MLLMs with the parameter-heavy Segment Anything Model (SAM) with 632M network parameters or adopt SAM-free lightweight pipelines that sacrifice accuracy. To address the trade-off between performance and cost, we specifically propose MLLMSeg, a novel framework that fully exploits the inherent visual detail features encoded in the MLLM vision encoder without introducing an extra visual encoder. Besides, we propose a detail-enhanced and semantic-consistent feature fusion module (DSFF) that fully integrates the detail-related visual feature with the semantic-related feature output by the large language model (LLM) of MLLM. Finally, we establish a light-weight mask decoder with only 34M network parameters that optimally leverages detailed spatial features from the visual encoder and semantic features from the LLM to achieve precise mask prediction. Extensive experiments demonstrate that our method generally surpasses both SAM-based and SAM-free competitors, striking a better balance between performance and cost. Code is available at https://github.com/jcwang0602/MLLMSeg.

[187] MA-CBP: A Criminal Behavior Prediction Framework Based on Multi-Agent Asynchronous Collaboration

Cheng Liu, Daou Zhang, Tingxu Liu, Yuhan Wang, Jinyang Chen, Yuexuan Li, Xinying Xiao, Chenbo Xin, Ziru Wang, Weichao Wu

Main category: cs.CV

TL;DR: MA-CBP is a multi-agent framework for real-time criminal behavior prediction that converts video streams to semantic descriptions and uses joint reasoning over long/short-term contexts for early warning.

DetailsMotivation: Traditional anomaly detection methods fail to capture high-level behavioral semantics from historical data, while LLM-based approaches cannot meet real-time requirements for urban public safety.

Method: Transforms video streams into frame-level semantic descriptions, constructs causally consistent historical summaries, fuses adjacent frames for joint reasoning over long/short-term contexts, and generates behavioral decisions with key elements.

Result: Achieves superior performance on multiple datasets and provides effective risk warning in urban public safety scenarios.

Conclusion: MA-CBP offers a promising solution for real-time criminal behavior prediction and early warning in urban environments, addressing limitations of both traditional feature-based and LLM-based approaches.

Abstract: With the acceleration of urbanization, criminal behavior in public scenes poses an increasingly serious threat to social security. Traditional anomaly detection methods based on feature recognition struggle to capture high-level behavioral semantics from historical information, while generative approaches based on Large Language Models (LLMs) often fail to meet real-time requirements. To address these challenges, we propose MA-CBP, a criminal behavior prediction framework based on multi-agent asynchronous collaboration. This framework transforms real-time video streams into frame-level semantic descriptions, constructs causally consistent historical summaries, and fuses adjacent image frames to perform joint reasoning over long- and short-term contexts. The resulting behavioral decisions include key elements such as event subjects, locations, and causes, enabling early warning of potential criminal activity. In addition, we construct a high-quality criminal behavior dataset that provides multi-scale language supervision, including frame-level, summary-level, and event-level semantic annotations. Experimental results demonstrate that our method achieves superior performance on multiple datasets and offers a promising solution for risk warning in urban public safety scenarios.

[188] Assessment of Using Synthetic Data in Brain Tumor Segmentation

Aditi Jahagirdar, Sameer Joshi

Main category: cs.CV

TL;DR: Synthetic MRI data from GANs can augment brain tumor segmentation training, improving boundary delineation but not solving class imbalance issues.

DetailsMotivation: Address challenges in brain tumor segmentation including tumor heterogeneity, scarce annotated data, and class imbalance in medical imaging datasets.

Method: Used pre-trained GAN model to generate synthetic MRI data, combined with real BraTS 2020 data in varying proportions to train U-Net segmentation networks.

Result: Hybrid datasets (40% real + 60% synthetic) improved whole tumor boundary delineation, but overall quantitative performance was comparable to real-only training, with persistent class imbalance in tumor core and enhancing tumor regions.

Conclusion: Synthetic data is feasible for augmentation in brain tumor segmentation, but requires larger-scale experiments, volumetric consistency, and better class imbalance mitigation strategies.

Abstract: Manual brain tumor segmentation from MRI scans is challenging due to tumor heterogeneity, scarcity of annotated data, and class imbalance in medical imaging datasets. Synthetic data generated by generative models has the potential to mitigate these issues by improving dataset diversity. This study investigates, as a proof of concept, the impact of incorporating synthetic MRI data, generated using a pre-trained GAN model, into training a U-Net segmentation network. Experiments were conducted using real data from the BraTS 2020 dataset, synthetic data generated with the medigan library, and hybrid datasets combining real and synthetic samples in varying proportions. While overall quantitative performance (Dice coefficient, IoU, precision, recall, accuracy) was comparable between real-only and hybrid-trained models, qualitative inspection suggested that hybrid datasets, particularly with 40% real and 60% synthetic data, improved whole tumor boundary delineation. However, region-wise accuracy for the tumor core and the enhancing tumor remained lower, indicating a persistent class imbalance. The findings support the feasibility of synthetic data as an augmentation strategy for brain tumor segmentation, while highlighting the need for larger-scale experiments, volumetric data consistency, and mitigating class imbalance in future work.

[189] C2PSA-Enhanced YOLOv11 Architecture: A Novel Approach for Small Target Detection in Cotton Disease Diagnosis

Kaiyuan Wang, Jixing Liu, Xiaobo Cai

Main category: cs.CV

TL;DR: Optimized YOLOv11 for cotton disease detection with improved small-target feature extraction, dynamic category weighting, and enhanced data augmentation, achieving 8-10.5% mAP improvement and 158 FPS inference speed for real-time agricultural monitoring.

DetailsMotivation: Address three key challenges in cotton disease detection: low precision in early spot detection (35% leakage rate for sub-5mm² spots), performance degradation in field conditions (25% accuracy drop), and high error rates (34.7%) in multi-disease scenarios.

Method: Proposed C2PSA module for enhanced small-target feature extraction, dynamic category weighting to handle sample imbalance, and improved data augmentation via Mosaic-MixUp scaling. Based on YOLOv11 architecture.

Result: Experimental results on 4,078-image dataset show: mAP50: 0.820 (+8.0% improvement); mAP50-95: 0.705 (+10.5% improvement); Inference speed: 158 FPS. Mobile-deployed system enables real-time monitoring.

Conclusion: The optimized YOLOv11 system successfully addresses key challenges in cotton disease detection, providing high-precision real-time monitoring capability suitable for precision treatment applications in agriculture.

Abstract: This study presents a deep learning-based optimization of YOLOv11 for cotton disease detection, developing an intelligent monitoring system. Three key challenges are addressed: (1) low precision in early spot detection (35% leakage rate for sub-5mm2 spots), (2) performance degradation in field conditions (25% accuracy drop), and (3) high error rates (34.7%) in multi-disease scenarios. The proposed solutions include: C2PSA module for enhanced small-target feature extraction; Dynamic category weighting to handle sample imbalance; Improved data augmentation via Mosaic-MixUp scaling. Experimental results on a 4,078-image dataset show: mAP50: 0.820 (+8.0% improvement); mAP50-95: 0.705 (+10.5% improvement); Inference speed: 158 FPS. The mobile-deployed system enables real-time disease monitoring and precision treatment in agricultural applications.

[190] SRMA-Mamba: Spatial Reverse Mamba Attention Network for Pathological Liver Segmentation in MRI Volumes

Jun Zeng, Yannan Huang, Elif Keles, Halil Ertugrul Aktas, Gorkem Durak, Nikhil Kumar Tomar, Quoc-Huy Trinh, Deepak Ranjan Nayak, Ulas Bagci, Debesh Jha

Main category: cs.CV

TL;DR: SRMA-Mamba is a novel Mamba-based network for 3D liver cirrhosis segmentation in MRI volumes that integrates spatial anatomical relationships and uses reverse attention to refine segmentation details, outperforming state-of-the-art methods.

DetailsMotivation: Early detection of liver cirrhosis is critical for reducing mortality, but existing methods underutilize spatial anatomical details in volumetric MRI data, limiting clinical effectiveness and explainability.

Method: Proposes SRMA-Mamba with Spatial Anatomy-Based Mamba module (SABMamba) that performs selective scans within cirrhotic tissues and combines anatomical information from sagittal, coronal, and axial planes. Also introduces Spatial Reverse Attention module (SRMA) to progressively refine segmentation details using coarse maps and hierarchical encoding features.

Result: Extensive experiments demonstrate that SRMA-Mamba surpasses state-of-the-art methods, delivering exceptional performance in 3D pathological liver segmentation.

Conclusion: The proposed SRMA-Mamba network effectively addresses the challenge of modeling spatial relationships in complex liver anatomical structures, providing superior volumetric segmentation performance for liver cirrhosis detection.

Abstract: Liver Cirrhosis plays a critical role in the prognosis of chronic liver disease. Early detection and timely intervention are critical in significantly reducing mortality rates. However, the intricate anatomical architecture and diverse pathological changes of liver tissue complicate the accurate detection and characterization of lesions in clinical settings. Existing methods underutilize the spatial anatomical details in volumetric MRI data, thereby hindering their clinical effectiveness and explainability. To address this challenge, we introduce a novel Mamba-based network, SRMA-Mamba, designed to model the spatial relationships within the complex anatomical structures of MRI volumes. By integrating the Spatial Anatomy-Based Mamba module (SABMamba), SRMA-Mamba performs selective Mamba scans within liver cirrhotic tissues and combines anatomical information from the sagittal, coronal, and axial planes to construct a global spatial context representation, enabling efficient volumetric segmentation of pathological liver structures. Furthermore, we introduce the Spatial Reverse Attention module (SRMA), designed to progressively refine cirrhotic details in the segmentation map, utilizing both the coarse segmentation map and hierarchical encoding features. Extensive experiments demonstrate that SRMA-Mamba surpasses state-of-the-art methods, delivering exceptional performance in 3D pathological liver segmentation. Our code is available for public: https://github.com/JunZengz/SRMA-Mamba.

[191] WIPES: Wavelet-based Visual Primitives

Wenhao Zhang, Hao Zhu, Delong Wu, Di Kang, Linchao Bao, Xun Cao, Zhan Ma

Main category: cs.CV

TL;DR: WIPES is a wavelet-based visual primitive that achieves high-quality rendering with fast inference by leveraging wavelet spatial-frequency localization, outperforming both INR-based and Gaussian-based methods.

DetailsMotivation: Existing visual representations suffer from spectrum loss due to frequency guidance or slow rendering from complex neural network decoding, limiting their practical application in 3D vision and graphics.

Method: Proposes WIPES, a universal wavelet-based visual primitive that captures both low and high frequency details using wavelet spatial-frequency localization, and develops a wavelet-based differentiable rasterizer for fast rendering.

Result: Experimental results across 2D image representation, 5D static and 6D dynamic novel view synthesis show WIPES provides higher rendering quality and faster inference than INR-based methods, and better rendering quality than Gaussian-based representations.

Conclusion: WIPES demonstrates that wavelet-based visual primitives offer an effective solution for high-quality, fast-rendering visual representations across multiple dimensions, addressing key limitations of existing approaches.

Abstract: Pursuing a continuous visual representation that offers flexible frequency modulation and fast rendering speed has recently garnered increasing attention in the fields of 3D vision and graphics. However, existing representations often rely on frequency guidance or complex neural network decoding, leading to spectrum loss or slow rendering. To address these limitations, we propose WIPES, a universal Wavelet-based vIsual PrimitivES for representing multi-dimensional visual signals. Building on the spatial-frequency localization advantages of wavelets, WIPES effectively captures both the low-frequency “forest” and the high-frequency “trees.” Additionally, we develop a wavelet-based differentiable rasterizer to achieve fast visual rendering. Experimental results on various visual tasks, including 2D image representation, 5D static and 6D dynamic novel view synthesis, demonstrate that WIPES, as a visual primitive, offers higher rendering quality and faster inference than INR-based methods, and outperforms Gaussian-based representations in rendering quality.

[192] Vehicle detection from GSV imagery: Predicting travel behaviour for cycling and motorcycling using Computer Vision

Kyriaki, Kokka, Rahul Goel, Ali Abbas, Kerry A. Nice, Luca Martial, SM Labib, Rihuan Ke, Carola Bibiane Schönlieb, James Woodcock

Main category: cs.CV

TL;DR: Deep learning on Google Street View images effectively estimates global cycling and motorcycling mode shares using YOLOv4 detection and beta regression models.

DetailsMotivation: Transportation impacts health through physical activity, pollution, and injury risks, but comparative global data on cycling and motorcycling behaviors is scarce.

Method: Used YOLOv4 model fine-tuned on 6 cities to detect cycles/motorcycles in 8000 GSV images per city across 185 global cities, then applied beta regression with population density controls.

Result: Strong correlation for motorcycles (0.78) and moderate for cycling (0.51); models achieved R² of 0.614/0.612 with median absolute errors of 1.3%/1.4%.

Conclusion: GSV imagery combined with computer vision provides valuable travel mode data complementing traditional sources, enabling global-scale transportation behavior analysis.

Abstract: Transportation influence health by shaping exposure to physical activity, air pollution and injury risk. Comparative data on cycling and motorcycling behaviours is scarce, particularly at a global scale. Street view imagery, such as Google Street View (GSV), combined with computer vision, is a valuable resource for efficiently capturing travel behaviour data. This study demonstrates a novel approach using deep learning on street view images to estimate cycling and motorcycling levels across diverse cities worldwide. We utilized data from 185 global cities. The data on mode shares of cycling and motorcycling estimated using travel surveys or censuses. We used GSV images to detect cycles and motorcycles in sampled locations, using 8000 images per city. The YOLOv4 model, fine-tuned using images from six cities, achieved a mean average precision of 89% for detecting cycles and motorcycles. A global prediction model was developed using beta regression with city-level mode shares as outcome, with log transformed explanatory variables of counts of GSV-detected images with cycles and motorcycles, while controlling for population density. We found strong correlations between GSV motorcycle counts and motorcycle mode share (0.78) and moderate correlations between GSV cycle counts and cycling mode share (0.51). Beta regression models predicted mode shares with $R^2$ values of 0.614 for cycling and 0.612 for motorcycling, achieving median absolute errors (MDAE) of 1.3% and 1.4%, respectively. Scatterplots demonstrated consistent prediction accuracy, though cities like Utrecht and Cali were outliers. The model was applied to 60 cities globally for which we didn’t have recent mode share data. We provided estimates for some cities in the Middle East, Latin America and East Asia. With computer vision, GSV images capture travel modes and activity, providing insights alongside traditional data sources.

[193] Fully Automated Segmentation of Fiber Bundles in Anatomic Tracing Data

Kyriaki-Margarita Bintsi, Yaël Balbastre, Jingjing Wu, Julia F. Lehman, Suzanne N. Haber, Anastasia Yendiki

Main category: cs.CV

TL;DR: Automated U-Net framework for fiber bundle segmentation in macaque tracer data with improved sparse bundle detection and reduced false discovery rates.

DetailsMotivation: Manual annotation of fiber bundles on histological slides is labor-intensive, and existing automated methods often miss sparse bundles or require complex post-processing, limiting large-scale analysis of anatomic tracer studies.

Method: U-Net architecture with large patch sizes, foreground aware sampling, and semisupervised pre-training for fully automated fiber bundle segmentation in standalone histological slices.

Result: Eliminates mislabeling errors, improves sparse bundle detection by over 20%, reduces False Discovery Rate by 40% compared to state-of-the-art methods.

Conclusion: This framework enables automated large-scale analysis of anatomic tracing data, generating more ground-truth data to validate and optimize dMRI tractography methods.

Abstract: Anatomic tracer studies are critical for validating and improving diffusion MRI (dMRI) tractography. However, large-scale analysis of data from such studies is hampered by the labor-intensive process of annotating fiber bundles manually on histological slides. Existing automated methods often miss sparse bundles or require complex post-processing across consecutive sections, limiting their flexibility and generalizability. We present a streamlined, fully automated framework for fiber bundle segmentation in macaque tracer data, based on a U-Net architecture with large patch sizes, foreground aware sampling, and semisupervised pre-training. Our approach eliminates common errors such as mislabeling terminals as bundles, improves detection of sparse bundles by over 20% and reduces the False Discovery Rate (FDR) by 40% compared to the state-of-the-art, all while enabling analysis of standalone slices. This new framework will facilitate the automated analysis of anatomic tracing data at a large scale, generating more ground-truth data that can be used to validate and optimize dMRI tractography methods.

cs.AI

[194] Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL

Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, Hongxuan Lu, Tianrui Qin, Chenghao Zhu, Yi Yao, Shuying Fan, Xiaowan Li, Tiannan Wang, Pai Liu, King Zhu, He Zhu, Dingfeng Shi, Piaohong Wang, Yeyi Guan, Xiangru Tang, Minghao Liu, Yuchen Eleanor Jiang, Jian Yang, Jiaheng Liu, Ge Zhang, Wangchunshu Zhou

Main category: cs.AI

TL;DR: Chain-of-Agents (CoA) is a new LLM reasoning paradigm that enables end-to-end complex problem-solving within one model, simulating multi-agent collaboration through dynamic tool and role-playing agent activation, trained via multi-agent distillation and agentic reinforcement learning.

DetailsMotivation: Existing multi-agent systems rely on manual prompt/workflow engineering, making them computationally inefficient, less capable, and unable to benefit from data-centric learning. The authors aim to create a more efficient and capable end-to-end solution.

Method: Introduces Chain-of-Agents paradigm with multi-agent distillation framework to distill state-of-the-art multi-agent systems into CoA trajectories for supervised fine-tuning, followed by agentic reinforcement learning on verifiable tasks to create Agent Foundation Models (AFMs).

Result: AFM establishes new state-of-the-art performance across diverse benchmarks in both web agent and code agent settings.

Conclusion: The approach offers a solid foundation for future research on agent models and agentic RL, with full open-sourcing of model weights, code, and training data.

Abstract: Recent advances in large language models (LLMs) and multi-agent systems have demonstrated remarkable capabilities in complex problem-solving tasks such as deep research, vibe coding, and mathematical reasoning. However, most existing multi-agent systems are built upon manual prompt/workflow engineering with sophisticated agent frameworks, making them computationally inefficient, less capable, and can not benefit from data-centric learning. In this work, we introduce Chain-of-Agents (CoA), a novel paradigm of LLM reasoning that enables native end-to-end complex problem-solving in the same way as a multi-agent system (i.e., multi-turn problem solving with multiple tools and multiple agents) within one model. In chain-of-agents problem-solving, the model dynamically activates different tool agents and role-playing agents to simulate multi-agent collaboration in an end-to-end fashion. To elicit end-to-end chain-of-agents problem-solving abilities in LLMs, we introduce a multi-agent distillation framework to distill state-of-the-art multi-agent systems into chain-of-agents trajectories for agentic supervised fine-tuning. We then use agentic reinforcement learning on verifiable agentic tasks to further improve the models’ capabilities on chain-of-agents problem solving. We call the resulting models Agent Foundation Models (AFMs). Our empirical studies demonstrate that AFM establishes new state-of-the-art performance across diverse benchmarks in both web agent and code agent settings. We make the entire research, including the model weights, code for training and evaluation, and the training data, fully open-sourced, which offers a solid starting point for future research on agent models and agentic RL.

[195] Cognitive Workspace: Active Memory Management for LLMs – An Empirical Study of Functional Infinite Context

Tao An

Main category: cs.AI

TL;DR: Cognitive Workspace is a novel paradigm that emulates human cognitive mechanisms to overcome LLM context limitations, achieving 58.6% memory reuse vs 0% for traditional RAG with significant efficiency gains.

DetailsMotivation: Current LLMs face fundamental limitations in context management despite extended context windows. Traditional retrieval systems fail to capture the dynamic, task-driven nature of human memory management, lacking metacognitive awareness and active planning capabilities.

Method: Three core innovations: (1) active memory management with deliberate information curation, (2) hierarchical cognitive buffers enabling persistent working states, and (3) task-driven context optimization that dynamically adapts to cognitive demands. Based on cognitive science foundations including Baddeley’s working memory model and Clark’s extended mind thesis.

Result: Achieves 58.6% average memory reuse rate (54-60% across tasks) vs 0% for traditional RAG, with 17-18% net efficiency gain despite 3.3x higher operation counts. Statistical significance confirmed with p < 0.001 and Cohen’s d > 23 across multiple task types.

Conclusion: Cognitive Workspace represents a fundamental shift from information retrieval to genuine cognitive augmentation, establishing the first quantitative evidence for active memory superiority in LLM systems with comprehensive theoretical framework synthesis.

Abstract: Large Language Models (LLMs) face fundamental limitations in context management despite recent advances extending context windows to millions of tokens. We propose Cognitive Workspace, a novel paradigm that transcends traditional Retrieval-Augmented Generation (RAG) by emulating human cognitive mechanisms of external memory use. Drawing from cognitive science foundations including Baddeley’s working memory model, Clark’s extended mind thesis, and Hutchins’ distributed cognition framework, we demonstrate that current passive retrieval systems fail to capture the dynamic, task-driven nature of human memory management. Our analysis of 2024-2025 developments reveals that while techniques like Infini-attention and StreamingLLM achieve impressive context lengths, they lack the metacognitive awareness and active planning capabilities essential for true cognitive extension. Cognitive Workspace addresses these limitations through three core innovations: (1) active memory management with deliberate information curation, (2) hierarchical cognitive buffers enabling persistent working states, and (3) task-driven context optimization that dynamically adapts to cognitive demands. Empirical validation demonstrates Cognitive Workspace achieves an average 58.6% memory reuse rate (ranging from 54-60% across different tasks) compared to 0% for traditional RAG, with 17-18% net efficiency gain despite 3.3x higher operation counts. Statistical analysis confirms these advantages with p < 0.001 and Cohen’s d > 23 across multiple task types, establishing the first quantitative evidence for active memory superiority in LLM systems. We present a comprehensive theoretical framework synthesizing insights from 50+ recent papers, positioning Cognitive Workspace as a fundamental shift from information retrieval to genuine cognitive augmentation.

[196] AlphaEval: A Comprehensive and Efficient Evaluation Framework for Formula Alpha Mining

Hongjun Ding, Binqi Chen, Jinsheng Huang, Taian Guo, Zhengyang Mao, Guoyi Shao, Lutong Zou, Luchen Liu, Ming Zhang

Main category: cs.AI

TL;DR: AlphaEval is a unified evaluation framework for automated alpha mining that assesses alphas across five dimensions (predictive power, stability, robustness, financial logic, diversity) without backtesting, offering comprehensive insights and higher efficiency.

DetailsMotivation: Existing alpha evaluation methods have limitations - backtesting is computationally intensive and sequential, while correlation metrics overlook crucial properties like stability, robustness, diversity, and interpretability. The closed-source nature of most models also hinders reproducibility.

Method: Proposed AlphaEval framework that evaluates generated alphas along five complementary dimensions: predictive power, stability, robustness to market perturbations, financial logic, and diversity. The framework is parallelizable and backtest-free.

Result: Extensive experiments show AlphaEval achieves evaluation consistency comparable to comprehensive backtesting while providing more comprehensive insights and higher efficiency. It effectively identifies superior alphas compared to traditional single-metric screening approaches.

Conclusion: AlphaEval addresses key challenges in alpha mining evaluation by offering a unified, efficient framework that assesses multiple quality dimensions. The open-sourced implementation promotes reproducibility and community engagement in quantitative finance research.

Abstract: Formula alpha mining, which generates predictive signals from financial data, is critical for quantitative investment. Although various algorithmic approaches-such as genetic programming, reinforcement learning, and large language models-have significantly expanded the capacity for alpha discovery, systematic evaluation remains a key challenge. Existing evaluation metrics predominantly include backtesting and correlation-based measures. Backtesting is computationally intensive, inherently sequential, and sensitive to specific strategy parameters. Correlation-based metrics, though efficient, assess only predictive ability and overlook other crucial properties such as temporal stability, robustness, diversity, and interpretability. Additionally, the closed-source nature of most existing alpha mining models hinders reproducibility and slows progress in this field. To address these issues, we propose AlphaEval, a unified, parallelizable, and backtest-free evaluation framework for automated alpha mining models. AlphaEval assesses the overall quality of generated alphas along five complementary dimensions: predictive power, stability, robustness to market perturbations, financial logic, and diversity. Extensive experiments across representative alpha mining algorithms demonstrate that AlphaEval achieves evaluation consistency comparable to comprehensive backtesting, while providing more comprehensive insights and higher efficiency. Furthermore, AlphaEval effectively identifies superior alphas compared to traditional single-metric screening approaches. All implementations and evaluation tools are open-sourced to promote reproducibility and community engagement.

[197] Fitting Ontologies and Constraints to Relational Structures

Simon Hosemann, Jean Christoph Jung, Carsten Lutz, Sebastian Rudolph

Main category: cs.AI

TL;DR: The paper analyzes the computational complexity and algorithms for fitting ontologies and constraints to relational examples, focusing on description logics EL/ELI and various TGD classes.

DetailsMotivation: To develop methods for automatically constructing ontologies and constraints from positive and negative examples in the form of finite relational structures, which is important for knowledge base construction and refinement.

Method: The study examines description logics EL and ELI, and several classes of tuple-generating dependencies (TGDs) including full, guarded, frontier-guarded, frontier-one, and unrestricted TGDs, as well as inclusion dependencies. It analyzes computational complexity, designs algorithms, and investigates the size of fitting ontologies.

Result: The paper pinpoints exact computational complexity for different ontology and constraint languages. It shows that finite bases exist for EL, ELI, guarded TGDs, and inclusion dependencies, but generally do not exist for full, frontier-guarded, and frontier-one TGDs.

Conclusion: The research provides a comprehensive complexity analysis and algorithmic solutions for ontology fitting problems, revealing fundamental limitations on the existence of finite bases for certain constraint languages while establishing positive results for others.

Abstract: We study the problem of fitting ontologies and constraints to positive and negative examples that take the form of a finite relational structure. As ontology and constraint languages, we consider the description logics $\mathcal{E\mkern-2mu L}$ and $\mathcal{E\mkern-2mu LI}$ as well as several classes of tuple-generating dependencies (TGDs): full, guarded, frontier-guarded, frontier-one, and unrestricted TGDs as well as inclusion dependencies. We pinpoint the exact computational complexity, design algorithms, and analyze the size of fitting ontologies and TGDs. We also investigate the related problem of constructing a finite basis of concept inclusions / TGDs for a given set of finite structures. While finite bases exist for $\mathcal{E\mkern-2mu L}$, $\mathcal{E\mkern-2mu LI}$, guarded TGDs, and inclusion dependencies, they in general do not exist for full, frontier-guarded and frontier-one TGDs.

[198] A Hardware-oriented Approach for Efficient Active Inference Computation and Deployment

Nikola Pižurica, Nikola Milović, Igor Jovančević, Conor Heins, Miguel de Prado

Main category: cs.AI

TL;DR: A methodology that integrates pymdp with a sparse computational graph to reduce Active Inference deployment latency by 2x and memory by 35% for resource-constrained environments.

DetailsMotivation: Active Inference (AIF) offers robust decision-making but faces computational and memory challenges in resource-constrained environments, limiting its deployment potential.

Method: Integrates pymdp’s flexibility and efficiency with a unified, sparse computational graph specifically designed for hardware-efficient execution.

Result: Achieves over 2x reduction in latency and up to 35% memory reduction compared to traditional AIF implementations.

Conclusion: The approach successfully advances deployment of efficient AIF agents for real-time and embedded applications by significantly reducing computational and memory demands.

Abstract: Active Inference (AIF) offers a robust framework for decision-making, yet its computational and memory demands pose challenges for deployment, especially in resource-constrained environments. This work presents a methodology that facilitates AIF’s deployment by integrating pymdp’s flexibility and efficiency with a unified, sparse, computational graph tailored for hardware-efficient execution. Our approach reduces latency by over 2x and memory by up to 35%, advancing the deployment of efficient AIF agents for real-time and embedded applications.

[199] The Interpretability Analysis of the Model Can Bring Improvements to the Text-to-SQL Task

Cong Zhang

Main category: cs.AI

TL;DR: CESQL model integrates interpretability analysis with execution-guided strategy for WHERE clause parsing, using filtering adjustments and model fusion to enhance text-to-SQL performance on WikiSQL dataset.

DetailsMotivation: To improve foundational capabilities and generalization of text-to-SQL models for real-world applications by reducing dependence on condition column data and manual training labels.

Method: Combines model interpretability analysis with execution-guided strategy for semantic parsing, augmented with filtering adjustments, logical correlation refinements, and model fusion in the CESQL model.

Result: Excels on WikiSQL dataset for single-table queries, significantly boosting prediction accuracy while minimizing dependence on condition column data and avoiding impact of manually labeled training data.

Conclusion: Provides fresh perspectives for handling complex queries and irregular data scenarios in real-world database environments through enhanced basic query processing accuracy.

Abstract: To elevate the foundational capabilities and generalization prowess of the text-to-SQL model in real-world applications, we integrate model interpretability analysis with execution-guided strategy for semantic parsing of WHERE clauses in SQL queries. Furthermore, we augment this approach with filtering adjustments, logical correlation refinements, and model fusion, culminating in the design of the CESQL model that facilitates conditional enhancement. Our model excels on the WikiSQL dataset, which is emblematic of single-table database query tasks, markedly boosting the accuracy of prediction outcomes. When predicting conditional values in WHERE clauses, we have not only minimized our dependence on data within the condition columns of tables but also circumvented the impact of manually labeled training data. Our hope is that this endeavor to enhance accuracy in processing basic database queries will offer fresh perspectives for research into handling complex queries and scenarios featuring irregular data in real-world database environments.

[200] CardAIc-Agents: A Multimodal Framework with Hierarchical Adaptation for Cardiac Care Support

Yuting Zhang, Karina V. Bunting, Asgher Champsi, Xiaoxia Wang, Wenqi Lu, Alexander Thorley, Sandeep S Hothi, Zhaowen Qiu, Dipak Kotecha, Jinming Duan

Main category: cs.AI

TL;DR: CardAIc-Agents is a multimodal AI framework that addresses limitations in current AI systems for cardiovascular disease detection by providing adaptive reasoning, tool integration, continuous learning, and visual outputs to support clinical decision-making.

DetailsMotivation: Cardiovascular diseases are the leading cause of death worldwide, exacerbated by healthcare worker shortages. Current AI systems have limitations including rigid workflows, lack of domain-specific tools, static knowledge bases, and limited multimodal capabilities that hinder clinical application.

Method: Proposed CardAIc-Agents framework with: 1) CardiacRAG agent for generating plans from updatable cardiac knowledge, 2) chief agent with tool integration for autonomous execution, 3) stepwise update strategy for dynamic plan refinement, 4) multidisciplinary discussion tool for complex cases, and 5) visual review panels for clinician validation.

Result: Experiments across three datasets demonstrated that CardAIc-Agents outperformed mainstream Vision-Language Models, state-of-the-art agentic systems, and fine-tuned VLMs in efficiency and performance.

Conclusion: The framework successfully addresses key limitations in AI clinical applications by providing adaptive reasoning, continuous learning, multimodal capabilities, and clinician collaboration tools, showing promise for practical cardiovascular disease screening and detection.

Abstract: Cardiovascular diseases (CVDs) remain the foremost cause of mortality worldwide, a burden worsened by a severe deficit of healthcare workers. Artificial intelligence (AI) agents have shown potential to alleviate this gap via automated early detection and proactive screening, yet their clinical application remains limited by: 1) prompt-based clinical role assignment that relies on intrinsic model capabilities without domain-specific tool support; or 2) rigid sequential workflows, whereas clinical care often requires adaptive reasoning that orders specific tests and, based on their results, guides personalised next steps; 3) general and static knowledge bases without continuous learning capability; and 4) fixed unimodal or bimodal inputs and lack of on-demand visual outputs when further clarification is needed. In response, a multimodal framework, CardAIc-Agents, was proposed to augment models with external tools and adaptively support diverse cardiac tasks. Specifically, a CardiacRAG agent generated general plans from updatable cardiac knowledge, while the chief agent integrated tools to autonomously execute these plans and deliver decisions. To enable adaptive and case-specific customization, a stepwise update strategy was proposed to dynamically refine plans based on preceding execution results, once the task was assessed as complex. In addition, a multidisciplinary discussion tool was introduced to interpret challenging cases, thereby supporting further adaptation. When clinicians raised concerns, visual review panels were provided to assist final validation. Experiments across three datasets showed the efficiency of CardAIc-Agents compared to mainstream Vision-Language Models (VLMs), state-of-the-art agentic systems, and fine-tuned VLMs.

[201] Search-Time Data Contamination

Ziwen Han, Meher Mankikar, Julian Michael, Zifan Wang

Main category: cs.AI

TL;DR: Search-time contamination (STC) is a new form of data leakage where search-based LLM agents find evaluation datasets with ground truth answers on platforms like HuggingFace, enabling them to copy answers instead of reasoning, which undermines benchmark integrity.

DetailsMotivation: The paper aims to identify and address a novel contamination issue in evaluating search-based LLM agents, where retrieval tools can access evaluation datasets containing test questions and answers, compromising the validity of benchmark results.

Method: The researchers analyzed search-based agent logs and found that HuggingFace appears among retrieved sources. They conducted experiments on three benchmarks (HLE, SimpleQA, GPQA), measured contamination rates, performed ablation studies by blocking HuggingFace, and analyzed accuracy drops on contaminated subsets.

Result: Approximately 3% of questions were directly contaminated through HuggingFace datasets. When HuggingFace was blocked, accuracy dropped by about 15% on the contaminated subset. The study also found that publicly accessible evaluation datasets may not be the sole source of STC.

Conclusion: The paper proposes best practices for benchmark design and result reporting to address search-time contamination, and releases complete experiment logs to facilitate auditing of evaluation results for search-based LLM agents.

Abstract: Data contamination refers to the leakage of evaluation data into model training data, resulting in overfitting to supposedly held-out test sets and compromising test validity. We identify an analogous issue, search-time contamination (STC), in evaluating search-based LLM agents which use tools to gather information from online sources when answering user queries. STC occurs when the retrieval step surfaces a source containing the test question (or a near-duplicate) alongside its answer, enabling agents to copy rather than genuinely infer or reason, undermining benchmark integrity. We find that HuggingFace, an online platform hosting evaluation datasets, appears among retrieved sources in search based agent logs. Consequently, agents often explicitly acknowledge discovering question answer pairs from HuggingFace within their reasoning chains. On three commonly used capability benchmarks: Humanity’s Last Exam (HLE), SimpleQA, and GPQA, we demonstrate that for approximately 3% of questions, search-based agents directly find the datasets with ground truth labels on HuggingFace. When millions of evaluation queries target the same benchmark, even small, repeated leaks can accelerate the benchmark’s obsolescence, shortening its intended lifecycle. After HuggingFace is blocked, we observe a drop in accuracy on the contaminated subset of approximately 15%. We further show through ablation experiments that publicly accessible evaluation datasets on HuggingFace may not be the sole source of STC. To this end, we conclude by proposing best practices for benchmark design and result reporting to address this novel form of leakage and ensure trustworthy evaluation of search-based LLM agents. To facilitate the auditing of evaluation results, we also publicly release the complete logs from our experiments.

[202] QuickMerge++: Fast Token Merging with Autoregressive Prior

Dong Liu, Yanxuan Yu

Main category: cs.AI

TL;DR: QuickMerge is a lightweight token merging framework that reduces computational costs in autoregressive generation by dynamically selecting fewer tokens based on attention norms and using an entropy-based budget estimator.

DetailsMotivation: As generative models scale to larger inputs across language, vision, and video domains, token-level computation cost has become a key bottleneck. Most existing token selection methods are static, modality-specific, or incompatible with autoregressive generation.

Method: QuickMerge dynamically selects reduced tokens based on attention norm magnitude guided by entropy-based budget estimator. It uses a lightweight transformer prior trained over merged token sequence to preserve autoregressive compatibility.

Result: QuickMerge demonstrates consistent improvements in compute-accuracy tradeoffs across multi-modality domains, substantially reducing token counts while matching or exceeding performance of learned tokenizers and fixed-patch baselines.

Conclusion: QuickMerge enables accurate generation with fewer tokens through semantic salience estimation, flexible token budgets, and AR alignment, effectively addressing the computational bottleneck in large-scale generative models.

Abstract: As generative models scale to larger inputs across language, vision, and video domains, the cost of token-level computation has become a key bottleneck. While prior work suggests that only a subset of tokens significantly influence downstream predictions, most token selection methods are static, modality-specific, or incompatible with autoregressive generation. In this paper, we propose QuickMerge, a lightweight token merging framework designed for efficient next-token prediction. QuickMerge dynamically selects a reduced number of tokens based on attention norm magnitude, guided by an entropy-based budget estimator. To preserve autoregressive compatibility, we introduce a lightweight transformer prior trained over the merged token sequence. By combining semantic salience estimation, flexible token budgets, and AR alignment, QuickMerge enables accurate generation with fewer tokens. We evaluate QuickMerge across multi-modality domains, demonstrating consistent improvements in compute-accuracy tradeoffs. Specifically, QuickMerge reduces token counts sustantially while matching as well as exceeding the performance of learned tokenizers and fixed-patch baselines.

[203] AI sustains higher strategic tension than humans in chess

Adamo Cerioli, Edward D. Lee, Vito D. P. Servedio

Main category: cs.AI

TL;DR: AI chess players sustain higher strategic tension longer than humans, with tension levels varying by algorithmic complexity and human expertise levels, revealing different strategic approaches between AI and human players.

DetailsMotivation: To study the trade-off between immediate opportunities and long-term objectives in strategic decision-making by comparing human vs human and AI vs AI chess games.

Method: Proposed a network-based metric of piece-to-piece interaction to quantify strategic tension on the chess board and analyzed its evolution in games between different player types.

Result: AI players maintain higher levels of strategic tension for longer durations than elite human players. Cumulative tension increases with algorithmic complexity in AI and shows abrupt increases at 1600 and 2300 Elo ratings in human games.

Conclusion: AI tolerates interconnected positions with balanced offensive/defensive tactics over long periods, while humans limit tension and complexity, possibly due to cognitive limitations. This difference has implications for AI usage in complex strategic environments.

Abstract: Strategic decision-making involves managing the tension between immediate opportunities and long-term objectives. We study this trade-off in chess by characterizing and comparing dynamics between human vs human and AI vs AI games. We propose a network-based metric of piece-to-piece interaction to quantify the ongoing strategic tension on the board. Its evolution in games reveals that the most competitive AI players sustain higher levels of strategic tension for longer durations than elite human players. Cumulative tension varies with algorithmic complexity for AI and correspondingly in human-played games increases abruptly with expertise at about 1600 Elo and again at 2300 Elo. The profiles reveal different approaches. Highly competitive AI tolerates interconnected positions balanced between offensive and defensive tactics over long periods. Human play, in contrast, limits tension and game complexity, which may reflect cognitive limitations and adaptive strategies. The difference may have implications for AI usage in complex, strategic environments.

[204] Explicit v.s. Implicit Memory: Exploring Multi-hop Complex Reasoning Over Personalized Information

Zeyu Zhang, Yang Zhang, Haoran Tan, Rui Li, Xu Chen

Main category: cs.AI

TL;DR: The paper proposes a multi-hop personalized reasoning task for LLM agents, creates a dataset and evaluation framework, tests various memory methods, and introduces HybridMem to combine explicit and implicit memory approaches.

DetailsMotivation: Current memory approaches in LLM agents focus mainly on preference alignment and simple QA, but real-world complex tasks require multi-hop reasoning over large amounts of user information, which existing methods struggle with.

Method: The authors define a multi-hop personalized reasoning task, construct a dataset with evaluation framework, implement various explicit and implicit memory methods, conduct comprehensive experiments, and propose HybridMem to combine both memory paradigms.

Result: The paper demonstrates the effectiveness of their proposed HybridMem method through extensive experiments, showing improved performance in multi-hop reasoning over personalized information compared to individual memory approaches.

Conclusion: The research addresses limitations of current memory mechanisms in complex reasoning tasks and provides a valuable framework and dataset for future work on personalized multi-hop reasoning in LLM agents.

Abstract: In large language model-based agents, memory serves as a critical capability for achieving personalization by storing and utilizing users’ information. Although some previous studies have adopted memory to implement user personalization, they typically focus on preference alignment and simple question-answering. However, in the real world, complex tasks often require multi-hop reasoning on a large amount of user information, which poses significant challenges for current memory approaches. To address this limitation, we propose the multi-hop personalized reasoning task to explore how different memory mechanisms perform in multi-hop reasoning over personalized information. We explicitly define this task and construct a dataset along with a unified evaluation framework. Then, we implement various explicit and implicit memory methods and conduct comprehensive experiments. We evaluate their performance on this task from multiple perspectives and analyze their strengths and weaknesses. Besides, we explore hybrid approaches that combine both paradigms and propose the HybridMem method to address their limitations. We demonstrate the effectiveness of our proposed model through extensive experiments. To benefit the research community, we release this project at https://github.com/nuster1128/MPR.

[205] “DIVE” into Hydrogen Storage Materials Discovery with AI Agents

Di Zhang, Xue Jia, Tran Ba Hung, Seong Hoon Jang, Linda Zhang, Ryuhei Sato, Yusuke Hashimoto, Toyoto Sato, Kiyoe Konno, Shin-ichi Orimo, Hao Li

Main category: cs.AI

TL;DR: DIVE multi-agent workflow extracts and organizes experimental data from scientific figures/tables, improving data extraction accuracy by 10-30% compared to existing models, enabling rapid inverse design of hydrogen storage materials.

DetailsMotivation: Much materials data remains trapped in unstructured figures and tables in scientific literature, hindering AI-driven materials discovery and automated design workflows.

Method: Descriptive Interpretation of Visual Expression (DIVE) multi-agent workflow that systematically reads and organizes experimental data from graphical elements in scientific publications.

Result: DIVE improves data extraction accuracy by 10-15% over commercial models and over 30% relative to open-source models. Built a curated database of 30,000+ entries from 4,000 publications, enabling rapid inverse design that identifies new hydrogen storage compositions in 2 minutes.

Conclusion: The AI workflow and agent design are broadly transferable across diverse materials, providing a paradigm for AI-driven materials discovery by effectively unlocking trapped data in scientific literature.

Abstract: Data-driven artificial intelligence (AI) approaches are fundamentally transforming the discovery of new materials. Despite the unprecedented availability of materials data in the scientific literature, much of this information remains trapped in unstructured figures and tables, hindering the construction of large language model (LLM)-based AI agent for automated materials design. Here, we present the Descriptive Interpretation of Visual Expression (DIVE) multi-agent workflow, which systematically reads and organizes experimental data from graphical elements in scientific literatures. We focus on solid-state hydrogen storage materials-a class of materials central to future clean-energy technologies and demonstrate that DIVE markedly improves the accuracy and coverage of data extraction compared to the direct extraction by multimodal models, with gains of 10-15% over commercial models and over 30% relative to open-source models. Building on a curated database of over 30,000 entries from 4,000 publications, we establish a rapid inverse design workflow capable of identifying previously unreported hydrogen storage compositions in two minutes. The proposed AI workflow and agent design are broadly transferable across diverse materials, providing a paradigm for AI-driven materials discovery.

[206] Towards Unified Multimodal Financial Forecasting: Integrating Sentiment Embeddings and Market Indicators via Cross-Modal Attention

Sarthak Khanna, Armin Berger, David Berghaus, Tobias Deusser, Lorenz Sparrenberg, Rafet Sifa

Main category: cs.AI

TL;DR: STONK is a multimodal framework that combines numerical market data with sentiment-enriched news embeddings to improve daily stock movement prediction through feature concatenation and cross-modal attention.

DetailsMotivation: To address limitations of isolated analyses by creating a unified pipeline that integrates both numerical market indicators and textual news sentiment for more accurate stock forecasting.

Method: Combines numerical market indicators with sentiment-enriched news embeddings using feature concatenation and cross-modal attention mechanisms in a multimodal framework.

Result: Backtesting shows STONK outperforms numeric-only baselines, demonstrating improved performance in daily stock-movement prediction.

Conclusion: The framework provides evidence-based guidance for scalable multimodal financial forecasting, with source code made available on GitHub for further development and application.

Abstract: We propose STONK (Stock Optimization using News Knowledge), a multimodal framework integrating numerical market indicators with sentiment-enriched news embeddings to improve daily stock-movement prediction. By combining numerical & textual embeddings via feature concatenation and cross-modal attention, our unified pipeline addresses limitations of isolated analyses. Backtesting shows STONK outperforms numeric-only baselines. A comprehensive evaluation of fusion strategies and model configurations offers evidence-based guidance for scalable multimodal financial forecasting. Source code is available on GitHub

[207] Trust, but verify

Michael J. Yuan, Carlos Lospoy, Sydney Lai, James Snewin, Ju Long

Main category: cs.AI

TL;DR: Social consensus-based detection of unauthorized LLMs in decentralized AI networks with financial incentives via EigenLayer AVS

DetailsMotivation: Decentralized AI networks need to verify that individual nodes are running designated LLMs to maintain service quality and prevent malicious actors from providing incorrect or unauthorized services

Method: Using social consensus among mostly honest peer nodes to detect unauthorized or incorrect LLMs, combined with an intersubjective validation system implemented as an EigenLayer AVS that introduces financial incentives and penalties

Result: Experimental data from the Gaia network demonstrates successful detection of nodes running unauthorized or incorrect LLMs through peer consensus mechanisms

Conclusion: Social consensus combined with financial incentives through EigenLayer AVS provides an effective mechanism for maintaining service quality and honest behavior in decentralized AI agent networks

Abstract: Decentralized AI agent networks, such as Gaia, allows individuals to run customized LLMs on their own computers and then provide services to the public. However, in order to maintain service quality, the network must verify that individual nodes are running their designated LLMs. In this paper, we demonstrate that in a cluster of mostly honest nodes, we can detect nodes that run unauthorized or incorrect LLM through social consensus of its peers. We will discuss the algorithm and experimental data from the Gaia network. We will also discuss the intersubjective validation system, implemented as an EigenLayer AVS to introduce financial incentives and penalties to encourage honest behavior from LLM nodes.

[208] HiFo-Prompt: Prompting with Hindsight and Foresight for LLM-based Automatic Heuristic Design

Chentong Chen, Mengyuan Zhong, Jianyong Sun, Ye Fan, Jialong Shi

Main category: cs.AI

TL;DR: HiFo-Prompt is a novel LLM-based framework that uses Foresight and Hindsight prompting strategies to improve automatic heuristic design in evolutionary computation, achieving better performance and faster convergence than existing methods.

DetailsMotivation: Current LLM-based automatic heuristic design in evolutionary computation suffers from static operators and lacks knowledge accumulation mechanisms, limiting its effectiveness.

Method: The framework uses two synergistic prompting strategies: Foresight-based prompts that adaptively steer search based on population dynamics, and Hindsight-based prompts that distill successful heuristics into reusable design principles, creating a persistent knowledge base.

Result: HiFo-Prompt significantly outperforms state-of-the-art LLM-based AHD methods, generating higher-quality heuristics with substantially faster convergence and superior query efficiency.

Conclusion: The dual Foresight-Hindsight mechanism successfully transforms transient discoveries into persistent knowledge, enabling LLMs to learn from their own experience and achieve better performance in automatic heuristic design.

Abstract: LLM-based Automatic Heuristic Design (AHD) within Evolutionary Computation (EC) frameworks has shown promising results. However, its effectiveness is hindered by the use of static operators and the lack of knowledge accumulation mechanisms. We introduce HiFo-Prompt, a framework that guides LLMs with two synergistic prompting strategies: Foresight and Hindsight. Foresight-based prompts adaptively steer the search based on population dynamics, managing the exploration-exploitation trade-off. In addition, hindsight-based prompts mimic human expertise by distilling successful heuristics from past generations into fundamental, reusable design principles. This dual mechanism transforms transient discoveries into a persistent knowledge base, enabling the LLM to learn from its own experience. Empirical results demonstrate that HiFo-Prompt significantly outperforms state-of-the-art LLM-based AHD methods, generating higher-quality heuristics while achieving substantially faster convergence and superior query efficiency.

[209] Language-Guided Multi-Agent Learning in Simulations: A Unified Framework and Evaluation

Zhengyang Li

Main category: cs.AI

TL;DR: LLM-MARL integrates large language models into multi-agent reinforcement learning to improve coordination, communication, and generalization in game environments through modular components for subgoal generation, symbolic messaging, and memory.

DetailsMotivation: To enhance multi-agent coordination, communication, and generalization capabilities in simulated environments by leveraging the reasoning and language capabilities of large language models within reinforcement learning frameworks.

Method: A unified framework with three modular components (Coordinator, Communicator, Memory) that uses PPO training with language-conditioned loss and LLM query gating. Evaluated in Google Research Football, MAgent Battle, and StarCraft II environments.

Result: Consistent improvements over MAPPO and QMIX in win rate, coordination score, and zero-shot generalization. Ablation studies show significant contributions from subgoal generation and language-based messaging. Emergent behaviors include role specialization and communication-driven tactics.

Conclusion: LLM-MARL successfully bridges language modeling and policy learning, providing a path forward for leveraging LLMs in multi-agent systems for training, games, and human-AI collaboration.

Abstract: This paper introduces LLM-MARL, a unified framework that incorporates large language models (LLMs) into multi-agent reinforcement learning (MARL) to enhance coordination, communication, and generalization in simulated game environments. The framework features three modular components of Coordinator, Communicator, and Memory, which dynamically generate subgoals, facilitate symbolic inter-agent messaging, and support episodic recall. Training combines PPO with a language-conditioned loss and LLM query gating. LLM-MARL is evaluated in Google Research Football, MAgent Battle, and StarCraft II. Results show consistent improvements over MAPPO and QMIX in win rate, coordination score, and zero-shot generalization. Ablation studies demonstrate that subgoal generation and language-based messaging each contribute significantly to performance gains. Qualitative analysis reveals emergent behaviors such as role specialization and communication-driven tactics. By bridging language modeling and policy learning, this work contributes to the design of intelligent, cooperative agents in interactive simulations. It offers a path forward for leveraging LLMs in multi-agent systems used for training, games, and human-AI collaboration.

[210] LOOP: A Plug-and-Play Neuro-Symbolic Framework for Enhancing Planning in Autonomous Systems

Ronit Virwani, Ruchika Suryawanshi

Main category: cs.AI

TL;DR: LOOP is a neuro-symbolic planning framework that enables iterative conversation between neural and symbolic components, achieving 85.8% success rate on IPC benchmarks compared to 55.0% for LLM+P.

DetailsMotivation: Current neural planning approaches struggle with complex domains, producing plans with errors, while classical planners lack flexibility and natural language understanding. Existing neuro-symbolic methods use one-shot translation, missing the opportunity for iterative refinement.

Method: LOOP integrates 13 coordinated neural features including graph neural networks, multi-agent validation, hierarchical decomposition, and causal memory. It generates PDDL specifications, refines them iteratively based on symbolic feedback, and builds a causal knowledge base from execution traces.

Result: LOOP achieved 85.8% success rate on six standard IPC benchmark domains, significantly outperforming LLM+P (55.0%), LLM-as-Planner (19.2%), and Tree-of-Thoughts (3.3%).

Conclusion: The key to reliable planning is making neural networks and symbolic reasoners actually “talk” to each other during the entire process, not choosing between them. LOOP provides a blueprint for building trustworthy autonomous systems for critical real-world applications.

Abstract: Planning is one of the most critical tasks in autonomous systems, where even a small error can lead to major failures or million-dollar losses. Current state-of-the-art neural planning approaches struggle with complex domains, producing plans with missing preconditions, inconsistent goals, and hallucinations. While classical planners provide logical guarantees, they lack the flexibility and natural language understanding capabilities needed for modern autonomous systems. Existing neuro-symbolic approaches use one-shot translation from natural language to formal plans, missing the opportunity for neural and symbolic components to work and refine solutions together. To address this gap, we develop LOOP – a novel neuro-symbolic planning framework that treats planning as an iterative conversation between neural and symbolic components rather than simple translation. LOOP integrates 13 coordinated neural features including graph neural networks for spatial relationships, multi-agent validation for consensus-based correctness, hierarchical decomposition for complex task management, and causal memory that learns from both successes and failures. Unlike existing approaches, LOOP generates PDDL specifications, refines them iteratively based on symbolic feedback, and builds a causal knowledge base from execution traces. LOOP was evaluated on six standard IPC benchmark domains, where it achieved 85.8% success rate compared to LLM+P (55.0%), LLM-as-Planner (19.2%), and Tree-of-Thoughts (3.3%). This work shows that the key to reliable planning is not in choosing between neural networks or symbolic reasoners but it lies in making them actually ``talk’’ to each other during the entire process. LOOP provides a thorough blueprint for building autonomous systems that can finally be trusted with critical real-world applications.

[211] SPANER: Shared Prompt Aligner for Multimodal Semantic Representation

Thye Shan Ng, Caren Soyeon Han, Eun-Jung Holden

Main category: cs.AI

TL;DR: SPANER is a modality-agnostic PEFT framework that uses shared prompts to embed diverse modalities into a unified semantic space, improving cross-modal generalization and semantic coherence.

DetailsMotivation: Existing multimodal PEFT approaches focus on task-specific gains but neglect embedding space structure, leaving modality-specific representations isolated and limiting cross-modal generalization.

Method: SPANER employs a shared prompt mechanism that acts as a conceptual anchor, enabling semantically related instances to converge spatially regardless of modality. The framework is extensible to support additional modalities without architectural changes.

Result: Comprehensive experiments across vision-language and audio-visual benchmarks show competitive few-shot retrieval performance while preserving high semantic coherence in the learned embedding space.

Conclusion: Aligning embedding structures (rather than just tuning adapter weights) is crucial for scalable multimodal learning, and SPANER’s shared prompt approach effectively achieves this across diverse modalities.

Abstract: Recent advances in multimodal Parameter-Efficient Fine-Tuning (PEFT) have significantly improved performance on downstream tasks such as few-shot retrieval. However, most existing approaches focus on task-specific gains while neglecting the structure of the multimodal embedding space. As a result, modality-specific representations often remain isolated, limiting cross-modal generalisation. In this work, we introduce Shared Prompt AligNER (SPANER), a modality-agnostic PEFT framework designed to embed inputs from diverse modalities into a unified semantic space. At its core, SPANER employs a shared prompt mechanism that acts as a conceptual anchor, enabling semantically related instances to converge spatially regardless of modality. This shared prompt design is inherently extensible, supporting the seamless integration of additional modalities, such as audio, without altering the core architecture. Through comprehensive experiments across vision-language and audio-visual benchmarks, SPANER demonstrates competitive few-shot retrieval performance while preserving high semantic coherence in the learned embedding space. Our results highlight the importance of aligning embedding structures, rather than merely tuning adapter weights, for scalable multimodal learning.

[212] TASER: Table Agents for Schema-guided Extraction and Recommendation

Nicole Cho, Kirsty Fielding, William Watson, Sumitra Ganesh, Manuela Veloso

Main category: cs.AI

TL;DR: TASER is an agentic table extraction system that handles messy, multi-page financial tables through continuous learning and schema-guided extraction, outperforming existing models by 10.1% and increasing extracted holdings by 9.8%.

DetailsMotivation: Real-world financial documents contain essential information buried in messy, fragmented tables across multiple pages, with 99.4% of tables lacking bounding boxes and some spanning up to 426 rows across 44 pages, presenting unique extraction challenges.

Method: TASER uses table agents for detection, classification, extraction, and recommendations leveraging an initial schema, with a Recommender Agent that reviews outputs, recommends schema revisions, and makes final decisions in a continuous learning process.

Result: TASER outperforms Table Transformer by 10.1%, larger batch sizes yield 104.3% increase in actionable schema recommendations, and 9.8% increase in extracted holdings. The system was trained on 22,584 pages with 3,213 tables representing $731B+ holdings.

Conclusion: Agentic, schema-guided extraction systems show promise for robust understanding of real-world financial tables, and the TASERTab dataset is released to enable further research on real financial table extraction.

Abstract: Real-world financial documents report essential information about an entity’s financial holdings that can span millions of different financial instrument types. Yet, these details are often buried in messy, multi-page, fragmented tables - for example, 99.4% of the tables in our dataset have no bounding boxes with the maximum number of rows amounting to 426 per table across 44 pages. To tackle these unique challenges from real-world tables, we present a continuously learning, agentic table extraction system, TASER (Table Agents for Schema-guided Extraction and Recommendation) that extracts highly unstructured, multi-page, heterogeneous tables into normalized, schema-conforming outputs. Our table agents execute on table detection, classification, extraction, and recommendations by leveraging an initial schema. Then, our Recommender Agent reviews the outputs, recommends schema revisions, and decides on the final recommendations, enabling TASER to outperform existing table detection models such as Table Transformer by 10.1%. Within this continuous learning process, we highlight that larger batch sizes result in a 104.3% increase in schema recommendations that are actionable and utilized, resulting in a 9.8% increase in extracted holdings - highlighting the importance of a continuous learning process. To train TASER, we have manually labeled 22,584 pages (28,150,449 tokens), 3,213 tables for $731,685,511,687 of holdings culminating in one of the first real financial table datasets. We release our dataset TASERTab to enable the research community to access real-world financial tables and outputs. Our results highlight the promise of agentic, schema-guided extraction systems for robust understanding of real-world financial tables.

[213] Virtuous Machines: Towards Artificial General Science

Gabrielle Wehr, Reuben Rideaux, Amaya J. Fox, David R. Lightfoot, Jason Tangen, Jason B. Mattingley, Shane E. Ehrhardt

Main category: cs.AI

TL;DR: AI system autonomously conducted three psychological studies from hypothesis to manuscript preparation, demonstrating capability for scientific research with limitations in theoretical interpretation.

DetailsMotivation: Address the challenge of synthesizing knowledge across disciplines and developing unifying theories due to exponential growth of scientific literature and domain specialization, by exploring general-purpose AI systems for science.

Method: Domain-agnostic, agentic AI system that autonomously navigates the entire scientific workflow - hypothesis generation, data collection (online study with 288 participants), analysis pipeline development through 8+ hour coding sessions, and manuscript preparation.

Result: Successfully designed and executed three psychological studies on visual working memory, mental rotation, and imagery vividness, producing completed manuscripts with theoretical reasoning and methodological rigor comparable to experienced researchers.

Conclusion: This represents a step toward embodied AI that can test hypotheses through real-world experiments and explore scientific spaces beyond human cognitive and resource constraints, while raising important questions about scientific understanding and credit attribution.

Abstract: Artificial intelligence systems are transforming scientific discovery by accelerating specific research tasks, from protein structure prediction to materials design, yet remain confined to narrow domains requiring substantial human oversight. The exponential growth of scientific literature and increasing domain specialisation constrain researchers’ capacity to synthesise knowledge across disciplines and develop unifying theories, motivating exploration of more general-purpose AI systems for science. Here we show that a domain-agnostic, agentic AI system can independently navigate the scientific workflow - from hypothesis generation through data collection to manuscript preparation. The system autonomously designed and executed three psychological studies on visual working memory, mental rotation, and imagery vividness, executed one new online data collection with 288 participants, developed analysis pipelines through 8-hour+ continuous coding sessions, and produced completed manuscripts. The results demonstrate the capability of AI scientific discovery pipelines to conduct non-trivial research with theoretical reasoning and methodological rigour comparable to experienced researchers, though with limitations in conceptual nuance and theoretical interpretation. This is a step toward embodied AI that can test hypotheses through real-world experiments, accelerating discovery by autonomously exploring regions of scientific space that human cognitive and resource constraints might otherwise leave unexplored. It raises important questions about the nature of scientific understanding and the attribution of scientific credit.

[214] STPFormer: A State-of-the-Art Pattern-Aware Spatio-Temporal Transformer for Traffic Forecasting

Jiayu Fang, Zhiqi Shao, S T Boris Choy, Junbin Gao

Main category: cs.AI

TL;DR: STPFormer is a novel spatio-temporal Transformer that achieves state-of-the-art traffic forecasting through pattern-aware temporal encoding, sequential spatial learning, cross-domain alignment, and multi-scale fusion.

DetailsMotivation: Existing Transformer-based models struggle with rigid temporal encoding and weak space-time fusion in complex spatio-temporal traffic forecasting tasks with diverse input formats.

Method: Proposes STPFormer with four key modules: Temporal Position Aggregator (TPA) for pattern-aware temporal encoding, Spatial Sequence Aggregator (SSA) for sequential spatial learning, Spatial-Temporal Graph Matching (STGM) for cross-domain alignment, and Attention Mixer for multi-scale fusion.

Result: Experiments on five real-world datasets show STPFormer consistently achieves state-of-the-art performance, with ablation studies and visualizations confirming its effectiveness and generalizability.

Conclusion: STPFormer provides a unified and interpretable representation learning framework that successfully addresses the challenges of complex temporal patterns, dynamic spatial structures, and diverse input formats in spatio-temporal traffic forecasting.

Abstract: Spatio-temporal traffic forecasting is challenging due to complex temporal patterns, dynamic spatial structures, and diverse input formats. Although Transformer-based models offer strong global modeling, they often struggle with rigid temporal encoding and weak space-time fusion. We propose STPFormer, a Spatio-Temporal Pattern-Aware Transformer that achieves state-of-the-art performance via unified and interpretable representation learning. It integrates four modules: Temporal Position Aggregator (TPA) for pattern-aware temporal encoding, Spatial Sequence Aggregator (SSA) for sequential spatial learning, Spatial-Temporal Graph Matching (STGM) for cross-domain alignment, and an Attention Mixer for multi-scale fusion. Experiments on five real-world datasets show that STPFormer consistently sets new SOTA results, with ablation and visualizations confirming its effectiveness and generalizability.

[215] Discrete Optimization of Min-Max Violation and its Applications Across Computational Sciences

Cheikh Ahmed, Mahdi Mostajabdaveh, Samin Aref, Zirui Zhou

Main category: cs.AI

TL;DR: The paper introduces Discrete Min-Max Violation (DMMV) as a general optimization problem for minimizing worst-case constraint violations, develops a GPU-accelerated heuristic solver, and demonstrates significant improvements across three applications: language model quantization, discrete tomography, and FIR filter design.

DetailsMotivation: Many real-world optimization problems require minimizing worst-case constraint violations across various domains, but existing methods are often domain-specific. The authors aim to create a general, context-free mathematical formulation that can handle diverse use cases with worst-case performance requirements.

Method: The authors mathematically define the DMMV problem, explore its properties, and develop a GPU-accelerated heuristic that leverages the mathematical structure of DMMV for efficient computation. The method is validated on three distinct applications.

Result: The GPU-accelerated heuristic achieved: 14% average improvement in language model quantization, 16% reduction in reconstruction error for discrete tomography with 6x GPU speedup, and nearly 50% ripple reduction in FIR filter design compared to commercial solver Gurobi.

Conclusion: DMMV provides an effective context-free optimization framework for worst-case constraint minimization problems. The proposed GPU-accelerated heuristic demonstrates superior performance across diverse applications, and making it open-source will facilitate further research and applications.

Abstract: We introduce the Discrete Min-Max Violation (DMMV) as a general optimization problem which seeks an assignment of discrete values to variables that minimizes the largest constraint violation. This context-free mathematical formulation is applicable to a wide range of use cases that have worst-case performance requirements. After defining the DMMV problem mathematically, we explore its properties to establish a foundational understanding. To tackle DMMV instance sizes of practical relevance, we develop a GPU-accelerated heuristic that takes advantage of the mathematical properties of DMMV for speeding up the solution process. We demonstrate the versatile applicability of our heuristic by solving three optimization problems as use cases: (1) post-training quantization of language models, (2) discrete tomography, and (3) Finite Impulse Response (FIR) filter design. In quantization without outlier separation, our heuristic achieves 14% improvement on average over existing methods. In discrete tomography, it reduces reconstruction error by 16% under uniform noise and accelerates computations by a factor of 6 on GPU. For FIR filter design, it nearly achieves 50% ripple reduction compared to using the commercial integer optimization solver, Gurobi. Our comparative results point to the benefits of studying DMMV as a context-free optimization problem and the advantages that our proposed heuristic offers on three distinct problems. Our GPU-accelerated heuristic will be made open-source to further stimulate research on DMMV and its other applications. The code is available at https://anonymous.4open.science/r/AMVM-5F3E/

[216] LM Agents May Fail to Act on Their Own Risk Knowledge

Yuzhi Tang, Tianxiao Li, Elizabeth Li, Chris J. Maddison, Honghua Dong, Yangjun Ruan

Main category: cs.AI

TL;DR: LM agents show strong risk knowledge but fail to apply it in practice, with significant gaps between awareness and execution safety. A risk verifier system reduces risky actions by 55.3%.

DetailsMotivation: Language model agents pose severe risks in safety-critical scenarios despite having theoretical risk knowledge, creating a dangerous gap between awareness and actual safety execution.

Method: Developed a comprehensive evaluation framework across three dimensions: risk knowledge, risk identification in trajectories, and actual behavior. Created a risk verifier with abstractor to critique agent actions by converting specific trajectories into abstract descriptions.

Result: Agents show near-perfect risk knowledge (>98% pass rates) but performance drops >23% when identifying risks in actual scenarios, and <26% pass rates for avoiding risky actions. The risk verifier system reduces risky action execution by 55.3% compared to vanilla agents.

Conclusion: Simply scaling model capabilities or inference compute doesn’t resolve safety concerns. The risk verifier approach effectively bridges the gap between risk knowledge and safe execution, significantly improving agent safety.

Abstract: Language model (LM) agents have demonstrated significant potential for automating real-world tasks, yet they pose a diverse array of potential, severe risks in safety-critical scenarios. In this work, we identify a significant gap between LM agents’ risk awareness and safety execution abilities: while they often answer “Yes” to queries like “Is executing `sudo rm -rf /*’ dangerous?”, they will likely fail to identify such risks in instantiated trajectories or even directly perform these risky actions when acting as agents. To systematically investigate this, we develop a comprehensive evaluation framework to examine agents’ safety across three progressive dimensions: 1) their knowledge about potential risks, 2) their ability to identify corresponding risks in execution trajectories, and 3) their actual behaviors to avoid executing these risky actions. Our evaluation reveals two critical performance gaps that resemble the generator-validator gaps observed in LMs: while agents demonstrate near-perfect risk knowledge ($>98%$ pass rates), they fail to apply this knowledge when identifying risks in actual scenarios (with performance dropping by $>23%$) and often still execute risky actions ($<26%$ pass rates). Notably, this trend persists across more capable LMs as well as in specialized reasoning models like DeepSeek-R1, indicating that simply scaling model capabilities or inference compute does not inherently resolve safety concerns. Instead, we take advantage of these observed gaps to develop a risk verifier that independently critiques the proposed actions by agents, with an abstractor that converts specific execution trajectories into abstract descriptions where LMs can more effectively identify the risks. Our overall system achieves a significant reduction of risky action execution by $55.3%$ over vanilla-prompted agents.

[217] CrafterDojo: A Suite of Foundation Models for Building Open-Ended Embodied Agents in Crafter

Junyeong Park, Hyeonseo Cho, Sungjin Ahn

Main category: cs.AI

TL;DR: CrafterDojo introduces foundation models and tools to make Crafter environment a lightweight alternative to Minecraft for embodied AI research.

DetailsMotivation: Minecraft is too slow and complex for rapid prototyping, while Crafter lacks foundation models that have driven progress in Minecraft research.

Method: Developed CrafterVPT for behavior priors, CrafterCLIP for vision-language grounding, CrafterSteve-1 for instruction following, plus dataset generation tools and evaluation benchmarks.

Result: Created a complete suite of foundation models and tools that unlock Crafter as a prototyping-friendly testbed for embodied agent research.

Conclusion: CrafterDojo provides a lightweight, Minecraft-like environment with necessary foundation models to accelerate research in general-purpose embodied agents.

Abstract: Developing general-purpose embodied agents is a core challenge in AI. Minecraft provides rich complexity and internet-scale data, but its slow speed and engineering overhead make it unsuitable for rapid prototyping. Crafter offers a lightweight alternative that retains key challenges from Minecraft, yet its use has remained limited to narrow tasks due to the absence of foundation models that have driven progress in the Minecraft setting. In this paper, we present CrafterDojo, a suite of foundation models and tools that unlock the Crafter environment as a lightweight, prototyping-friendly, and Minecraft-like testbed for general-purpose embodied agent research. CrafterDojo addresses this by introducing CrafterVPT, CrafterCLIP, and CrafterSteve-1 for behavior priors, vision-language grounding, and instruction following, respectively. In addition, we provide toolkits for generating behavior and caption datasets (CrafterPlay and CrafterCaption), reference agent implementations, benchmark evaluations, and a complete open-source codebase.

[218] Toward Better EHR Reasoning in LLMs: Reinforcement Learning with Expert Attention Guidance

Yue Fang, Yuxin Guo, Jiaran Gao, Hongxin Ding, Xinke Jiang, Weibin Liao, Yongxin Xu, Yinghao Zhu, Zhibang Yang, Liantao Ma, Junfeng Zhao, Yasha Wang

Main category: cs.AI

TL;DR: EAG-RL is a novel two-stage training framework that enhances LLMs’ EHR reasoning ability through expert attention guidance and reinforcement learning, achieving 14.62% average improvement in clinical prediction tasks.

DetailsMotivation: LLMs underperform on EHR-based prediction tasks due to challenges in modeling temporally structured, high-dimensional data, and existing hybrid approaches fail to improve LLMs' intrinsic reasoning capacity while inheriting DL models' generalization limitations.

Method: Two-stage framework: 1) constructs stepwise reasoning trajectories using expert-guided Monte Carlo Tree Search to initialize LLM policy, 2) optimizes policy via reinforcement learning by aligning LLM’s attention with clinically salient features identified by expert EHR models.

Result: EAG-RL improves LLMs’ intrinsic EHR reasoning ability by average 14.62% on two real-world EHR datasets, enhances robustness to feature perturbations, and improves generalization to unseen clinical domains.

Conclusion: EAG-RL demonstrates practical potential for real-world deployment in clinical prediction tasks by effectively enhancing LLMs’ EHR reasoning capabilities through expert guidance and reinforcement learning.

Abstract: Improving large language models (LLMs) for electronic health record (EHR) reasoning is essential for enabling accurate and generalizable clinical predictions. While LLMs excel at medical text understanding, they underperform on EHR-based prediction tasks due to challenges in modeling temporally structured, high-dimensional data. Existing approaches often rely on hybrid paradigms, where LLMs serve merely as frozen prior retrievers while downstream deep learning (DL) models handle prediction, failing to improve the LLM’s intrinsic reasoning capacity and inheriting the generalization limitations of DL models. To this end, we propose EAG-RL, a novel two-stage training framework designed to intrinsically enhance LLMs’ EHR reasoning ability through expert attention guidance, where expert EHR models refer to task-specific DL models trained on EHR data. Concretely, EAG-RL first constructs high-quality, stepwise reasoning trajectories using expert-guided Monte Carlo Tree Search to effectively initialize the LLM’s policy. Then, EAG-RL further optimizes the policy via reinforcement learning by aligning the LLM’s attention with clinically salient features identified by expert EHR models. Extensive experiments on two real-world EHR datasets show that EAG-RL improves the intrinsic EHR reasoning ability of LLMs by an average of 14.62%, while also enhancing robustness to feature perturbations and generalization to unseen clinical domains. These results demonstrate the practical potential of EAG-RL for real-world deployment in clinical prediction tasks. Our code have been available at https://github.com/devilran6/EAG-RL.

[219] Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation

Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Liming Zheng, Yufeng Zhong, Lin Ma

Main category: cs.AI

TL;DR: MSRL (Multimodal Structured Reinforcement Learning) breaks through SFT performance plateaus in chart-to-code generation using multimodal rewards and achieves state-of-the-art results.

DetailsMotivation: RL is underutilized for complex vision-language tasks requiring structured outputs. SFT alone hits performance plateaus in chart-to-code generation, necessitating effective RL strategies with proper reward systems.

Method: Multimodal structured reward system with textual rule-based rewards for code details and visual model-based rewards for structural similarity. Two-stage curriculum training on 3M real-world chart-code pairs.

Result: MSRL improves high-level metrics by 6.2% on ChartMimic and 9.9% on ReachQA benchmarks, achieving competitive performance with advanced closed-source models.

Conclusion: MSRL effectively breaks SFT performance plateaus through multimodal structured rewards, demonstrating RL’s potential for complex structured output generation tasks.

Abstract: While reinforcement learning (RL) has proven highly effective for general reasoning in vision-language models, its application to tasks requiring in-depth understanding of information-rich images and generation of structured outputs remains underexplored. Chart-to-code generation exemplifies this challenge, demanding complex reasoning over visual charts to generate structured code. Supervised fine-tuning (SFT) alone is often insufficient, highlighting the need for effective RL strategies that appropriately reward structured outputs. We systematically investigate the performance plateau in SFT through large-scale experiments and propose Multimodal Structured Reinforcement Learning (MSRL) for chart-to-code generation, which substantially breaks through this plateau. We construct the largest training corpus to date, containing 3 million chart-code pairs from real-world arXiv tables to mitigate simplistic patterns of prior synthetic data. Despite reaching state-of-the-art performance, our experiments show that scaling SFT data eventually hits a plateau where further increases yield negligible improvements. Our MSRL method leverages a multi-granularity structured reward system using multimodal textual and visual feedback. At the textual level, rule-based rewards validate fine-grained code details. At the visual level, model-based rewards assess structural similarity by rendering generated code into images and employing an evaluator model. We implement this within a two-stage curriculum for training stability. Results demonstrate that MSRL significantly breaks the SFT plateau, improving high-level metrics by 6.2% and 9.9% on ChartMimic and ReachQA benchmarks respectively, achieving competitive performance with advanced closed-source models.

[220] V2P: From Background Suppression to Center Peaking for Robust GUI Grounding Task

Jikai Chen, Long Chen, Dong Wang, Leilei Gan, Chenyi Zhuang, Jinjie Gu

Main category: cs.AI

TL;DR: V2P method improves GUI element localization using suppression attention to reduce background distractions and Gaussian heatmaps for center-edge distinction, achieving 92.3% and 50.5% performance on benchmarks.

DetailsMotivation: Traditional GUI localization methods neglect spatial interaction uncertainty and visual-semantic hierarchies, while current attention-based approaches suffer from background distractions and fail to distinguish between center and edges of UI elements, leading to click imprecision.

Method: Proposes Valley-to-Peak (V2P) method with: 1) suppression attention mechanism to minimize focus on irrelevant background regions, and 2) Fitts’ Law-inspired 2D Gaussian heatmaps where weight decreases from center to edges with variance determined by target size.

Result: Achieves 92.3% performance on ScreenSpot-v2 benchmark and 50.5% on ScreenSpot-Pro benchmark. Ablation studies confirm each component’s contribution to the method’s effectiveness.

Conclusion: V2P effectively isolates target areas and teaches models to concentrate on the most essential points of UI elements, demonstrating strong generalizability for precise GUI grounding tasks.

Abstract: Precise localization of GUI elements is crucial for the development of GUI agents. Traditional methods rely on bounding box or center-point regression, neglecting spatial interaction uncertainty and visual-semantic hierarchies. Recent methods incorporate attention mechanisms but still face two key issues: (1) ignoring processing background regions causes attention drift from the desired area, and (2) uniform labeling fails to distinguish between center and edges of the target UI element, leading to click imprecision. Inspired by how humans visually process and interact with GUI elements, we propose the Valley-to-Peak (V2P) method to address these issues. To mitigate background distractions, V2P introduces a suppression attention mechanism that minimizes the model’s focus on irrelevant regions to highlight the intended region. For the issue of center-edge distinction, V2P applies a Fitts’ Law-inspired approach by modeling GUI interactions as 2D Gaussian heatmaps where the weight gradually decreases from the center towards the edges. The weight distribution follows a Gaussian function, with the variance determined by the target’s size. Consequently, V2P effectively isolates the target area and teaches the model to concentrate on the most essential point of the UI element. The model trained by V2P achieves the performance with 92.3% and 50.5% on two benchmarks ScreenSpot-v2 and ScreenSpot-Pro. Ablations further confirm each component’s contribution, highlighting V2P’s generalizability for precise GUI grounding tasks.

[221] Interactive Query Answering on Knowledge Graphs with Soft Entity Constraints

Daniel Daza, Alberto Bernardi, Luca Costabello, Christophe Gueret, Masoud Mansoury, Michael Cochez, Martijn Schut

Main category: cs.AI

TL;DR: Neural Query Reranker (NQR) for answering queries with soft constraints over incomplete knowledge graphs, using interactive refinement with preference examples.

DetailsMotivation: Existing query answering methods focus on first-order-logic queries but real-world queries often involve vague or context-dependent constraints like preferences for attributes or categories.

Method: Propose Neural Query Reranker (NQR) that adjusts query answer scores by incorporating soft constraints interactively using incremental examples of preferred and non-preferred entities.

Result: Experiments on extended QA benchmarks with soft constraints show NQR can effectively capture soft constraints while maintaining robust query answering performance.

Conclusion: NQR successfully addresses the gap in handling soft constraints for query answering over incomplete knowledge graphs without disrupting original query answers.

Abstract: Methods for query answering over incomplete knowledge graphs retrieve entities that are likely to be answers, which is particularly useful when such answers cannot be reached by direct graph traversal due to missing edges. However, existing approaches have focused on queries formalized using first-order-logic. In practice, many real-world queries involve constraints that are inherently vague or context-dependent, such as preferences for attributes or related categories. Addressing this gap, we introduce the problem of query answering with soft constraints. We propose a Neural Query Reranker (NQR) designed to adjust query answer scores by incorporating soft constraints without disrupting the original answers to a query. NQR operates interactively, refining answers based on incremental examples of preferred and non-preferred entities. We extend existing QA benchmarks by generating datasets with soft constraints. Our experiments demonstrate that NQR can capture soft constraints while maintaining robust query answering performance.

[222] ITL-LIME: Instance-Based Transfer Learning for Enhancing Local Explanations in Low-Resource Data Settings

Rehan Raza, Guanjin Wang, Kevin Wong, Hamid Laga, Marco Fisichella

Main category: cs.AI

TL;DR: ITL-LIME enhances LIME by using instance transfer learning with real source domain instances instead of random perturbations, improving explanation fidelity and stability in data-scarce environments.

DetailsMotivation: LIME's randomness in perturbation and sampling causes locality and instability issues, especially with limited training data, leading to unrealistic variations and poor approximation of complex decision boundaries.

Method: Proposes ITL-LIME framework that uses clustering to partition source domain, retrieves relevant real source instances instead of random perturbations, employs contrastive learning-based encoder for weighting, and trains surrogate model with weighted instances.

Result: The method improves explanation fidelity and stability by leveraging real instances from related source domains and defining compact locality through intelligent instance selection and weighting.

Conclusion: ITL-LIME effectively addresses LIME’s limitations in data-constrained environments by incorporating instance transfer learning and contrastive learning, providing more reliable and stable explanations for black-box models.

Abstract: Explainable Artificial Intelligence (XAI) methods, such as Local Interpretable Model-Agnostic Explanations (LIME), have advanced the interpretability of black-box machine learning models by approximating their behavior locally using interpretable surrogate models. However, LIME’s inherent randomness in perturbation and sampling can lead to locality and instability issues, especially in scenarios with limited training data. In such cases, data scarcity can result in the generation of unrealistic variations and samples that deviate from the true data manifold. Consequently, the surrogate model may fail to accurately approximate the complex decision boundary of the original model. To address these challenges, we propose a novel Instance-based Transfer Learning LIME framework (ITL-LIME) that enhances explanation fidelity and stability in data-constrained environments. ITL-LIME introduces instance transfer learning into the LIME framework by leveraging relevant real instances from a related source domain to aid the explanation process in the target domain. Specifically, we employ clustering to partition the source domain into clusters with representative prototypes. Instead of generating random perturbations, our method retrieves pertinent real source instances from the source cluster whose prototype is most similar to the target instance. These are then combined with the target instance’s neighboring real instances. To define a compact locality, we further construct a contrastive learning-based encoder as a weighting mechanism to assign weights to the instances from the combined set based on their proximity to the target instance. Finally, these weighted source and target instances are used to train the surrogate model for explanation purposes.

[223] Knowledge Graph Completion for Action Prediction on Situational Graphs – A Case Study on Household Tasks

Mariam Arustashvili, Jörg Deigmöller, Heiko Paulheim

Main category: cs.AI

TL;DR: Standard link prediction algorithms perform poorly on situational knowledge graphs for household actions, failing to outperform simple baselines due to unique characteristics of this domain.

DetailsMotivation: Knowledge graphs for household actions are crucial for controlling robots and analyzing video footage, but video-extracted information is often incomplete, requiring effective knowledge graph completion methods.

Method: The paper investigates standard link prediction approaches applied to situational knowledge graphs describing household actions, comparing their performance against simple baselines.

Result: Many existing link prediction algorithms are not suitable for situational knowledge graphs and cannot outperform even simple baseline methods.

Conclusion: Situational knowledge graphs have special characteristics that require specialized link prediction approaches rather than standard algorithms.

Abstract: Knowledge Graphs are used for various purposes, including business applications, biomedical analyses, or digital twins in industry 4.0. In this paper, we investigate knowledge graphs describing household actions, which are beneficial for controlling household robots and analyzing video footage. In the latter case, the information extracted from videos is notoriously incomplete, and completing the knowledge graph for enhancing the situational picture is essential. In this paper, we show that, while a standard link prediction problem, situational knowledge graphs have special characteristics that render many link prediction algorithms not fit for the job, and unable to outperform even simple baselines.

[224] MHSNet:An MoE-based Hierarchical Semantic Representation Network for Accurate Duplicate Resume Detection with Large Language Model

Yu Li, Zulong Chen, Wenjian Xu, Hong Wen, Yipeng Yu, Man Lung Yiu, Yuyu Yin

Main category: cs.AI

TL;DR: MHSNet is a multi-level identity verification framework that uses fine-tuned BGE-M3 with contrastive learning and Mixture-of-Experts to detect duplicate resumes from third-party websites, addressing challenges of semantic complexity, structural heterogeneity, and information incompleteness.

DetailsMotivation: To improve the quality of third-party resumes and enrich company talent pools by detecting duplicates between fetched resumes and existing ones, overcoming challenges of incomplete/ inaccurate resume data from external sources.

Method: Fine-tunes BGE-M3 using contrastive learning, employs Mixture-of-Experts (MoE) to generate multi-level sparse and dense representations for resumes, and computes multi-level semantic similarities with state-aware MoE to handle incomplete resumes.

Result: Experimental results verify the effectiveness of MHSNet in resume duplication detection.

Conclusion: MHSNet provides an effective solution for resume duplication detection that handles the complex challenges of third-party resume data through multi-level semantic analysis and specialized MoE architecture.

Abstract: To maintain the company’s talent pool, recruiters need to continuously search for resumes from third-party websites (e.g., LinkedIn, Indeed). However, fetched resumes are often incomplete and inaccurate. To improve the quality of third-party resumes and enrich the company’s talent pool, it is essential to conduct duplication detection between the fetched resumes and those already in the company’s talent pool. Such duplication detection is challenging due to the semantic complexity, structural heterogeneity, and information incompleteness of resume texts. To this end, we propose MHSNet, an multi-level identity verification framework that fine-tunes BGE-M3 using contrastive learning. With the fine-tuned , Mixture-of-Experts (MoE) generates multi-level sparse and dense representations for resumes, enabling the computation of corresponding multi-level semantic similarities. Moreover, the state-aware Mixture-of-Experts (MoE) is employed in MHSNet to handle diverse incomplete resumes. Experimental results verify the effectiveness of MHSNet

[225] Neuro-Symbolic Artificial Intelligence: Towards Improving the Reasoning Abilities of Large Language Models

Xiao-Wen Yang, Jie-Jing Shao, Lan-Zhe Guo, Bo-Wen Zhang, Zhi Zhou, Lin-Han Jia, Wang-Zhou Dai, Yu-Feng Li

Main category: cs.AI

TL;DR: This paper provides a comprehensive survey of neuro-symbolic approaches for enhancing reasoning capabilities in Large Language Models, categorizing methods into three perspectives and discussing future challenges.

DetailsMotivation: LLMs show promise but reasoning remains a fundamental challenge crucial for AGI development. Neuro-symbolic approaches offer a promising way to enhance LLM reasoning capabilities.

Method: The paper reviews recent neuro-symbolic developments through formalization of reasoning tasks, introduction to neurosymbolic learning paradigm, and analysis of three methodological perspectives: Symbolic->LLM, LLM->Symbolic, and LLM+Symbolic.

Result: The survey provides a structured framework for understanding and categorizing neuro-symbolic approaches to LLM reasoning enhancement, along with a released GitHub repository containing related papers and resources.

Conclusion: Neuro-symbolic methods represent a promising direction for improving LLM reasoning, though several key challenges remain that require further research and development in this emerging field.

Abstract: Large Language Models (LLMs) have shown promising results across various tasks, yet their reasoning capabilities remain a fundamental challenge. Developing AI systems with strong reasoning capabilities is regarded as a crucial milestone in the pursuit of Artificial General Intelligence (AGI) and has garnered considerable attention from both academia and industry. Various techniques have been explored to enhance the reasoning capabilities of LLMs, with neuro-symbolic approaches being a particularly promising way. This paper comprehensively reviews recent developments in neuro-symbolic approaches for enhancing LLM reasoning. We first present a formalization of reasoning tasks and give a brief introduction to the neurosymbolic learning paradigm. Then, we discuss neuro-symbolic methods for improving the reasoning capabilities of LLMs from three perspectives: Symbolic->LLM, LLM->Symbolic, and LLM+Symbolic. Finally, we discuss several key challenges and promising future directions. We have also released a GitHub repository including papers and resources related to this survey: https://github.com/LAMDASZ-ML/Awesome-LLM-Reasoning-with-NeSy.

[226] The DeepLog Neurosymbolic Machine

Vincent Derkinderen, Robin Manhaeve, Rik Adriaensen, Lucas Van Praet, Lennert De Smet, Giuseppe Marra, Luc De Raedt

Main category: cs.AI

TL;DR: DeepLog is a theoretical and operational framework for neurosymbolic AI that provides building blocks and primitives, featuring a language for specifying models and extended algebraic circuits for computation, enabling efficient GPU-based implementation and comparison of different neurosymbolic approaches.

DetailsMotivation: To create a unified framework that abstracts commonly used representations and computational mechanisms in neurosymbolic AI, allowing for representation and emulation of various neurosymbolic systems with different logics and implementation choices.

Method: Developed DeepLog with two key components: 1) DeepLog language - an annotated neural extension of grounded first-order logic for specifying models and inference tasks, and 2) Extended algebraic circuits as computational graphs. Implemented as a neurosymbolic abstract machine with software implementation leveraging GPU acceleration.

Result: The framework demonstrates generality and efficiency through experimental comparisons between different fuzzy and probabilistic logics, different uses of logic (architecture vs loss function), and performance comparisons between CPU-based standalone implementations and GPU-based DeepLog implementations.

Conclusion: DeepLog provides a comprehensive and efficient framework for neurosymbolic AI that enables easy experimentation with different algebraic structures and logics, offering both theoretical foundations and practical computational efficiency through GPU acceleration.

Abstract: We contribute a theoretical and operational framework for neurosymbolic AI called DeepLog. DeepLog introduces building blocks and primitives for neurosymbolic AI that make abstraction of commonly used representations and computational mechanisms used in neurosymbolic AI. DeepLog can represent and emulate a wide range of neurosymbolic systems. It consists of two key components. The first is the DeepLog language for specifying neurosymbolic models and inference tasks. This language consists of an annotated neural extension of grounded first-order logic, and makes abstraction of the type of logic, e.g. boolean, fuzzy or probabilistic, and whether logic is used in the architecture or in the loss function. The second DeepLog component is situated at the computational level and uses extended algebraic circuits as computational graphs. Together these two components are to be considered as a neurosymbolic abstract machine, with the DeepLog language as the intermediate level of abstraction and the circuits level as the computational one. DeepLog is implemented in software, relies on the latest insights in implementing algebraic circuits on GPUs, and is declarative in that it is easy to obtain different neurosymbolic models by making different choices for the underlying algebraic structures and logics. The generality and efficiency of the DeepLog neurosymbolic machine is demonstrated through an experimental comparison between 1) different fuzzy and probabilistic logics, 2) between using logic in the architecture or in the loss function, and 3) between a standalone CPU-based implementation of a neurosymbolic AI system and a DeepLog GPU-based one.

[227] CausalPlan: Empowering Efficient LLM Multi-Agent Collaboration Through Causality-Driven Planning

Minh Hoang Nguyen, Van Dai Do, Dung Nguyen, Thin Nguyen, Hung Le

Main category: cs.AI

TL;DR: CausalPlan framework integrates causal reasoning into LLM planning to reduce invalid actions and improve multi-agent coordination by learning causal relationships from agent trajectories.

DetailsMotivation: LLM agents, especially smaller open-source models, often produce causally invalid actions due to relying on surface-level correlations rather than grounded causal reasoning, which undermines their coordination and planning capabilities.

Method: Two-phase framework with Structural Causal Action (SCA) model that learns causal graphs from agent trajectories to capture how prior actions and environment states influence future decisions, then uses causal scores to guide action selection.

Result: CausalPlan consistently reduces invalid actions and improves collaboration in both AI-AI and human-AI settings across five multi-agent coordination tasks, outperforming reinforcement learning baselines on four LLMs of varying sizes.

Conclusion: Embedding causal knowledge directly into the decision loop enables intervention-consistent behaviors without LLM fine-tuning, demonstrating the value of causality-driven planning for efficient, interpretable, and generalizable multi-agent LLM systems.

Abstract: Large language model (LLM) agents-especially smaller, open-source models-often produce causally invalid or incoherent actions in collaborative tasks due to their reliance on surface-level correlations rather than grounded causal reasoning. This limitation undermines their performance in terms of coordination and planning in dynamic environments. We address this challenge with CausalPlan, a two-phase framework that integrates explicit structural causal reasoning into the LLM planning process. At the core of CausalPlan is the Structural Causal Action (SCA) model, which learns a causal graph from agent trajectories to capture how prior actions and current environment states influence future decisions. This structure is then used to guide action selection by assigning causal scores to LLM-generated proposals, reweighting them accordingly, or falling back to causally grounded alternatives when needed. By embedding this causal knowledge directly into the decision loop, CausalPlan constrains planning to intervention-consistent behaviours without requiring fine-tuning of the LLM itself. We evaluate CausalPlan on the Overcooked-AI benchmark across five multi-agent coordination tasks and four LLMs of varying sizes: Gemma-7B, Llama-8B, Qwen-14B, and Llama-70B. Experimental results show that CausalPlan consistently reduces invalid actions and improves collaboration in both AI-AI and human-AI settings, outperforming strong reinforcement learning baselines. Our findings highlight the value of causality-driven planning for deploying efficient, interpretable, and generalisable multi-agent LLM systems.

[228] Expertise-aware Multi-LLM Recruitment and Collaboration for Medical Decision-Making

Liuxin Bao, Zhihao Peng, Xiaofei Zhou, Runmin Cong, Jiyong Zhang, Yixuan Yuan

Main category: cs.AI

TL;DR: EMRC framework uses expertise-aware multi-LLM collaboration to improve medical decision-making accuracy by dynamically selecting optimal LLMs based on medical expertise and integrating their responses with confidence scoring.

DetailsMotivation: Single LLMs have limitations in medical decision-making due to parametric knowledge constraints and static training, failing to robustly integrate complex clinical information.

Method: Two-stage framework: (1) expertise-aware agent recruitment using LLM expertise table for dynamic selection, (2) confidence- and adversarial-driven multi-agent collaboration with confidence fusion and adversarial validation.

Result: Achieves 74.45% accuracy on MMLU-Pro-Health dataset, 2.69% improvement over GPT-4-0613, outperforms state-of-the-art single- and multi-LLM methods across three public MDM datasets.

Conclusion: EMRC demonstrates effectiveness of expertise-aware agent recruitment and agent complementarity in leveraging specialized LLM capabilities for superior medical diagnostic performance.

Abstract: Medical Decision-Making (MDM) is a complex process requiring substantial domain-specific expertise to effectively synthesize heterogeneous and complicated clinical information. While recent advancements in Large Language Models (LLMs) show promise in supporting MDM, single-LLM approaches are limited by their parametric knowledge constraints and static training corpora, failing to robustly integrate the clinical information. To address this challenge, we propose the Expertise-aware Multi-LLM Recruitment and Collaboration (EMRC) framework to enhance the accuracy and reliability of MDM systems. It operates in two stages: (i) expertise-aware agent recruitment and (ii) confidence- and adversarial-driven multi-agent collaboration. Specifically, in the first stage, we use a publicly available corpus to construct an LLM expertise table for capturing expertise-specific strengths of multiple LLMs across medical department categories and query difficulty levels. This table enables the subsequent dynamic selection of the optimal LLMs to act as medical expert agents for each medical query during the inference phase. In the second stage, we employ selected agents to generate responses with self-assessed confidence scores, which are then integrated through the confidence fusion and adversarial validation to improve diagnostic reliability. We evaluate our EMRC framework on three public MDM datasets, where the results demonstrate that our EMRC outperforms state-of-the-art single- and multi-LLM methods, achieving superior diagnostic performance. For instance, on the MMLU-Pro-Health dataset, our EMRC achieves 74.45% accuracy, representing a 2.69% improvement over the best-performing closed-source model GPT- 4-0613, which demonstrates the effectiveness of our expertise-aware agent recruitment strategy and the agent complementarity in leveraging each LLM’s specialized capabilities.

[229] Quantifier Instantiations: To Mimic or To Revolt?

Jan Jakubův, Mikoláš Janota

Main category: cs.AI

TL;DR: Novel instantiation approach using probabilistic context-free grammars to learn from existing techniques and generate similar terms for quantified formulas in SMT solving.

DetailsMotivation: Quantified formulas are challenging for SMT solvers due to undecidability, and existing instantiation techniques often complement each other but could benefit from dynamic learning.

Method: Treat observed instantiations as samples from a latent language and use probabilistic context-free grammars to generate new similar terms, optionally inverting learned probabilities for diversity.

Result: The method can mimic successful past instantiations while also exploring diversity through probability inversion.

Conclusion: This approach provides a way to balance exploitation and exploration in quantifier reasoning for SMT solvers.

Abstract: Quantified formulas pose a significant challenge for Satisfiability Modulo Theories (SMT) solvers due to their inherent undecidability. Existing instantiation techniques, such as e-matching, syntax-guided, model-based, conflict-based, and enumerative methods, often complement each other. This paper introduces a novel instantiation approach that dynamically learns from these techniques during solving. By treating observed instantiations as samples from a latent language, we use probabilistic context-free grammars to generate new, similar terms. Our method not only mimics successful past instantiations but also explores diversity by optionally inverting learned term probabilities, aiming to balance exploitation and exploration in quantifier reasoning.

[230] Revisiting RAG Ensemble: A Theoretical and Mechanistic Analysis of Multi-RAG System Collaboration

Yifei Chen, Guanting Dong, Yutao Zhu, Zhicheng Dou

Main category: cs.AI

TL;DR: This paper explores ensemble methods for Retrieval-Augmented Generation (RAG) systems, providing theoretical analysis from information entropy perspective and experimental validation across multiple pipeline and module configurations.

DetailsMotivation: Single RAG frameworks cannot adapt well to diverse downstream tasks, so leveraging multiple RAG systems through ensemble methods becomes necessary to improve performance and robustness.

Method: Comprehensive investigation of RAG ensemble framework through theoretical analysis (information entropy perspective) and mechanistic analysis (pipeline and module levels). Selected four pipelines (Branching, Iterative, Loop, Agentic) and three modules (Generator, Retriever, Reranker) to address seven research questions.

Result: Experiments demonstrate that aggregating multiple RAG systems is both generalizable and robust at both pipeline and module levels, showing improved performance across various configurations.

Conclusion: The work establishes a foundation for multi-RAG system ensemble research, proving that ensemble approaches enhance adaptability and performance of RAG technology across diverse applications.

Abstract: Retrieval-Augmented Generation (RAG) technology has been widely applied in recent years. However, despite the emergence of various RAG frameworks, a single RAG framework still cannot adapt well to a broad range of downstream tasks. Therefore, how to leverage the advantages of multiple RAG systems has become an area worth exploring. To address this issue, we have conducted a comprehensive and systematic investigation into ensemble methods based on RAG systems. Specifically, we have analyzed the RAG ensemble framework from both theoretical and mechanistic analysis perspectives. From the theoretical analysis, we provide the first explanation of the RAG ensemble framework from the perspective of information entropy. In terms of mechanism analysis, we have explored the RAG ensemble framework from both the pipeline and module levels. We carefully select four different pipelines (Branching, Iterative, Loop, and Agentic) and three different modules (Generator, Retriever, and Reranker) to solve seven different research questions. The experiments show that aggregating multiple RAG systems is both generalizable and robust, whether at the pipeline level or the module level. Our work lays the foundation for similar research on the multi-RAG system ensemble.

[231] Improved Generalized Planning with LLMs through Strategy Refinement and Reflection

Katharina Stein, Nils Hodel, Daniel Fišer, Jörg Hoffmann, Michael Katz, Alexander Koller

Main category: cs.AI

TL;DR: Improved LLM-based generalized planning using pseudocode debugging, reflection steps, and program variants to generate more reliable Python programs for PDDL domains.

DetailsMotivation: Previous LLM approaches generated only one strategy directly to Python, leading to incorrect generalized plans if the strategy was flawed. The paper aims to improve reliability by catching errors earlier in the process.

Method: Introduces pseudocode generation with automatic debugging, adds reflection steps to pinpoint failures, and generates multiple program variants to select the best one.

Result: Substantially improved generalized plan quality across 17 benchmark domains, with 12 domains achieving perfect solutions for all generated tasks.

Conclusion: The proposed extensions significantly enhance LLM-based generalized planning without deterioration, demonstrating robust performance across diverse PDDL domains.

Abstract: LLMs have recently been used to generate Python programs representing generalized plans in PDDL planning, i.e., plans that generalize across the tasks of a given PDDL domain. Previous work proposed a framework consisting of three steps: the LLM first generates a summary and then a strategy for the domain, both in natural language, and then implements that strategy as a Python program, that gets debugged on example planning tasks. In that work, only one strategy is generated and passed directly to the program generation. If the strategy is incorrect, its implementation will therefore result in an incorrect generalized plan. Here, we introduce an approach that generates the strategy in the form of pseudocode and enables automatic debugging of the pseudocode, hence allowing us to identify and fix errors prior to the generation of the generalized plan itself. Additionally, we extend the Python debugging phase with a reflection step prompting the LLM to pinpoint the reason for the observed plan failure. Finally, we take inspiration from LLM code generation to produce several program variants and pick the best one. Running experiments on 17 benchmark domains, we show that these extensions substantially improve (and never deteriorate) the quality of the generalized plans. In 12 of the domains, our best Python programs solve all tasks that can be generated with the respective instance generator.

[232] Structured Agentic Workflows for Financial Time-Series Modeling with LLMs and Reflective Feedback

Yihao Ang, Yifan Bao, Lei Jiang, Jiajie Tao, Anthony K. H. Tung, Lukasz Szpruch, Hao Ni

Main category: cs.AI

TL;DR: TS-Agent is a modular agentic framework that automates time-series modeling for financial applications using LLMs for reasoning and code generation, outperforming traditional AutoML methods.

DetailsMotivation: Existing AutoML frameworks lack adaptability to domain-specific needs in financial time-series modeling, while LLMs offer potential for more flexible workflow automation but need structured guidance.

Method: A three-stage iterative process: model selection, code refinement, and fine-tuning, guided by contextual reasoning and experimental feedback using a planner agent with structured knowledge banks and curated libraries.

Result: Empirical evaluations show TS-Agent consistently outperforms state-of-the-art AutoML and agentic baselines in financial forecasting and synthetic data generation tasks, achieving superior accuracy, robustness, and traceability.

Conclusion: TS-Agent provides an effective framework for automated time-series modeling that combines the flexibility of LLMs with structured decision-making, meeting the high-stakes requirements of financial services through adaptive learning and transparent auditing.

Abstract: Time-series data is central to decision-making in financial markets, yet building high-performing, interpretable, and auditable models remains a major challenge. While Automated Machine Learning (AutoML) frameworks streamline model development, they often lack adaptability and responsiveness to domain-specific needs and evolving objectives. Concurrently, Large Language Models (LLMs) have enabled agentic systems capable of reasoning, memory management, and dynamic code generation, offering a path toward more flexible workflow automation. In this paper, we introduce \textsf{TS-Agent}, a modular agentic framework designed to automate and enhance time-series modeling workflows for financial applications. The agent formalizes the pipeline as a structured, iterative decision process across three stages: model selection, code refinement, and fine-tuning, guided by contextual reasoning and experimental feedback. Central to our architecture is a planner agent equipped with structured knowledge banks, curated libraries of models and refinement strategies, which guide exploration, while improving interpretability and reducing error propagation. \textsf{TS-Agent} supports adaptive learning, robust debugging, and transparent auditing, key requirements for high-stakes environments such as financial services. Empirical evaluations on diverse financial forecasting and synthetic data generation tasks demonstrate that \textsf{TS-Agent} consistently outperforms state-of-the-art AutoML and agentic baselines, achieving superior accuracy, robustness, and decision traceability.

[233] The Collaboration Paradox: Why Generative AI Requires Both Strategic Intelligence and Operational Stability in Supply Chain Management

Soumyadeep Dhar

Main category: cs.AI

TL;DR: AI agents in supply chains can paradoxically perform worse than traditional systems due to inventory hoarding, requiring a dual-layer framework combining AI policy-setting with collaborative execution for stability.

DetailsMotivation: To understand emergent strategic behaviors of AI-driven agents in economic settings, particularly in multi-echelon supply chains prone to instabilities like the bullwhip effect.

Method: Computational experiments with generative AI agents (LLMs) in controlled supply chain simulations, testing collaborative agents designed with Vendor-Managed Inventory principles against non-AI baselines.

Result: Discovered the “collaboration paradox” - collaborative AI agents perform worse than baselines due to inventory hoarding. Resilience requires combining AI-driven policy-setting with collaborative execution protocols.

Conclusion: Provides crucial insights into AI agent emergent behaviors and offers a blueprint for designing stable AI-driven business systems through a synthesis of high-level policy and low-level execution.

Abstract: The rise of autonomous, AI-driven agents in economic settings raises critical questions about their emergent strategic behavior. This paper investigates these dynamics in the cooperative context of a multi-echelon supply chain, a system famously prone to instabilities like the bullwhip effect. We conduct computational experiments with generative AI agents, powered by Large Language Models (LLMs), within a controlled supply chain simulation designed to isolate their behavioral tendencies. Our central finding is the “collaboration paradox”: a novel, catastrophic failure mode where theoretically superior collaborative AI agents, designed with Vendor-Managed Inventory (VMI) principles, perform even worse than non-AI baselines. We demonstrate that this paradox arises from an operational flaw where agents hoard inventory, starving the system. We then show that resilience is only achieved through a synthesis of two distinct layers: high-level, AI-driven proactive policy-setting to establish robust operational targets, and a low-level, collaborative execution protocol with proactive downstream replenishment to maintain stability. Our final framework, which implements this synthesis, can autonomously generate, evaluate, and quantify a portfolio of viable strategic choices. The work provides a crucial insight into the emergent behaviors of collaborative AI agents and offers a blueprint for designing stable, effective AI-driven systems for business analytics.

[234] ChronoLLM: Customizing Language Models for Physics-Based Simulation Code Generation

Jingquan Wang, Andrew Negrut, Harry Zhang, Khailanii Slaton, Shu Wang, Radu Serban, Jinlong Wu, Dan Negrut

Main category: cs.AI

TL;DR: LLMs can be refined to serve as virtual assistants for PyChrono simulation tool, generating scripts that serve as strong starting points for experts despite not being perfect.

DetailsMotivation: To investigate whether pretrained large language models can be customized to help experts effectively use simulation tools like PyChrono by generating simulation scripts and answering API questions.

Method: A framework for refining and customizing both open- and closed-source LLMs through a process that quantifiably improves the quality of generated PyChrono simulation scripts, ranging from simple to complex experiments.

Result: The refined LLMs generate PyChrono scripts that serve as strong starting points for users to modify and improve, and can answer specific API questions and recommend modeling approaches.

Conclusion: The framework is generalizable and can lower the entry barrier for simulation tools in other application domains by leveraging AI to assist experts with simulation script generation.

Abstract: This contribution is concerned with the following issue: can pretrained large language models (LLMs) be refined and customized to the point where they become virtual assistants helping experts with the effective use of a simulation tool? In this case study, the ``simulation tool’’ considered is PyChrono, an open source multi-physics dynamics engine for multibody systems. We present a framework for refining and customizing both open- and closed-source LLMs to harness the power of AI in generating scripts that perform PyChrono virtual experiments. We refine and customize several classes of LLMs through a process that leads to a quantifiable improvement in the quality of the generated PyChrono simulation scripts. These scripts can range from simple single-pendulum simulations to complex virtual experiments involving full vehicles on deformable terrain. While the generated scripts are rarely perfect, they often serve as strong starting points for the user to modify and improve on. Additionally, the LLM can answer specific API questions about the simulator, or recommend modeling approaches. The framework discussed is general and can be applied to lower the entry barrier for simulation tools associated with other application domains.

[235] A Biased Random Key Genetic Algorithm for Solving the Longest Run Subsequence Problem

Christian Blum, Pedro Pinacho-Davidson

Main category: cs.AI

TL;DR: A Biased Random Key Genetic Algorithm (BRKGA) is proposed for solving the NP-hard Longest Run Subsequence problem, showing state-of-the-art performance compared to Max-Min Ant System and CPLEX solver.

DetailsMotivation: The LRS problem is an NP-hard combinatorial optimization problem from bioinformatics that plays a role in genome reassembly, requiring efficient computational solutions.

Method: Developed a BRKGA approach with focus on computational efficiency of evaluating individuals by converting vectors of gray values into valid solutions. Also implemented Max-Min Ant System and used CPLEX solver for comparison.

Result: The proposed BRKGA demonstrates state-of-the-art performance for the LRS problem, though results indicate room for improvement particularly with large alphabet sizes.

Conclusion: BRKGA is currently the best technique for LRS problem, but further improvements are needed especially for input strings with large alphabet sizes.

Abstract: The longest run subsequence (LRS) problem is an NP-hard combinatorial optimization problem belonging to the class of subsequence problems from bioinformatics. In particular, the problem plays a role in genome reassembly. In this paper, we present a solution to the LRS problem using a Biased Random Key Genetic Algorithm (BRKGA). Our approach places particular focus on the computational efficiency of evaluating individuals, which involves converting vectors of gray values into valid solutions to the problem. For comparison purposes, a Max-Min Ant System is developed and implemented. This is in addition to the application of the integer linear programming solver CPLEX for solving all considered problem instances. The computation results show that the proposed BRKGA is currently a state-of-the-art technique for the LRS problem. Nevertheless, the results also show that there is room for improvement, especially in the context of input strings based on large alphabet sizes.

[236] ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents

Hanyu Lai, Xiao Liu, Yanxiao Zhao, Han Xu, Hanchen Zhang, Bohao Jing, Yanyu Ren, Shuntian Yao, Yuxiao Dong, Jie Tang

Main category: cs.AI

TL;DR: ComputerRL is a framework for autonomous desktop intelligence that combines API calls and GUI interactions, using distributed RL training and Entropulse strategy to achieve state-of-the-art performance on desktop automation tasks.

DetailsMotivation: There's a fundamental mismatch between machine agents and human-centric desktop environments, and scaling end-to-end RL training for desktop automation remains challenging due to environmental inefficiency and training instability.

Method: Developed ComputerRL framework with API-GUI paradigm, distributed RL infrastructure for thousands of parallel virtual desktops, and Entropulse training strategy that alternates RL with supervised fine-tuning to prevent entropy collapse.

Result: AutoGLM-OS-9B based on GLM-4-9B-0414 achieved state-of-the-art 48.1% accuracy on OSWorld benchmark, demonstrating significant improvements for general agents in desktop automation.

Conclusion: ComputerRL successfully addresses desktop automation challenges through unified API-GUI interaction and scalable training infrastructure, enabling robust autonomous agents for complex digital workspaces.

Abstract: We introduce ComputerRL, a framework for autonomous desktop intelligence that enables agents to operate complex digital workspaces skillfully. ComputerRL features the API-GUI paradigm, which unifies programmatic API calls and direct GUI interaction to address the inherent mismatch between machine agents and human-centric desktop environments. Scaling end-to-end RL training is crucial for improvement and generalization across diverse desktop tasks, yet remains challenging due to environmental inefficiency and instability in extended training. To support scalable and robust training, we develop a distributed RL infrastructure capable of orchestrating thousands of parallel virtual desktop environments to accelerate large-scale online RL. Furthermore, we propose Entropulse, a training strategy that alternates reinforcement learning with supervised fine-tuning, effectively mitigating entropy collapse during extended training runs. We employ ComputerRL on open models GLM-4-9B-0414 and Qwen2.5-14B, and evaluate them on the OSWorld benchmark. The AutoGLM-OS-9B based on GLM-4-9B-0414 achieves a new state-of-the-art accuracy of 48.1%, demonstrating significant improvements for general agents in desktop automation. The algorithm and framework are adopted in building AutoGLM (Liu et al., 2024a)

[237] LEGO-GraphRAG: Modularizing Graph-based Retrieval-Augmented Generation for Design Space Exploration

Yukun Cao, Zengyi Gao, Zhiyang Li, Xike Xie, S. Kevin Zhou, Jianliang Xu

Main category: cs.AI

TL;DR: LEGO-GraphRAG is a modular framework that addresses gaps in GraphRAG by enabling workflow decomposition, technique classification, and creation of new instances for improved reasoning with knowledge graphs and LLMs.

DetailsMotivation: GraphRAG shows promise for improving reasoning accuracy by integrating knowledge graphs with LLMs, but lacks modular workflow analysis, systematic frameworks, and empirical studies.

Method: Proposes LEGO-GraphRAG framework with three key capabilities: fine-grained workflow decomposition, systematic classification of techniques, and creation of new GraphRAG instances.

Result: Enables comprehensive empirical studies on large-scale real-world graphs and diverse queries, providing insights into balancing reasoning quality, runtime efficiency, and computational costs.

Conclusion: LEGO-GraphRAG provides essential framework for building advanced GraphRAG systems by addressing current limitations and enabling systematic analysis and optimization.

Abstract: GraphRAG integrates (knowledge) graphs with large language models (LLMs) to improve reasoning accuracy and contextual relevance. Despite its promising applications and strong relevance to multiple research communities, such as databases and natural language processing, GraphRAG currently lacks modular workflow analysis, systematic solution frameworks, and insightful empirical studies. To bridge these gaps, we propose LEGO-GraphRAG, a modular framework that enables: 1) fine-grained decomposition of the GraphRAG workflow, 2) systematic classification of existing techniques and implemented GraphRAG instances, and 3) creation of new GraphRAG instances. Our framework facilitates comprehensive empirical studies of GraphRAG on large-scale real-world graphs and diverse query sets, revealing insights into balancing reasoning quality, runtime efficiency, and token or GPU cost, that are essential for building advanced GraphRAG systems.

[238] GoAI: Enhancing AI Students’ Learning Paths and Idea Generation via Graph of AI Ideas

Xian Gao, Zongyun Zhang, Ting Liu, Yuzhuo Fu

Main category: cs.AI

TL;DR: GoAI is a tool that builds educational knowledge graphs from AI research papers to help students identify prerequisite knowledge, trace field development through citation semantics, and plan personalized learning paths for innovation.

DetailsMotivation: AI students face an "information-to-innovation" gap where they struggle to navigate rapidly expanding literature, identify prerequisite knowledge, and understand how research methods build upon or challenge each other through citation relationships.

Method: Constructs knowledge graphs with nodes representing papers and prerequisite knowledge (concepts, skills, tools), and edges capturing semantic citation relationships. Uses beam search-based path search to trace field development from queried papers and plan learning paths. Includes Idea Studio for clarifying problems, comparing designs, and providing formative feedback.

Result: The system enables students to understand research field development trends, identify necessary learning prerequisites, and receive guidance on innovative concept development through structured feedback on novelty, clarity, feasibility, and alignment with learning objectives.

Conclusion: GoAI addresses critical gaps in AI education by leveraging semantic citation analysis and knowledge graphs to transform information overload into structured learning pathways that support both foundational knowledge acquisition and creative innovation in AI research.

Abstract: With the rapid advancement of artificial intelligence technology, AI students are confronted with a significant “information-to-innovation” gap: they must navigate through the rapidly expanding body of literature, trace the development of a specific research field, and synthesize various techniques into feasible innovative concepts. An additional critical step for students is to identify the necessary prerequisite knowledge and learning paths. Although many approaches based on large language models (LLMs) can summarize the content of papers and trace the development of a field through citations, these methods often overlook the prerequisite knowledge involved in the papers and the rich semantic information embedded in the citation relationships between papers. Such information reveals how methods are interrelated, built upon, extended, or challenged. To address these limitations, we propose GoAI, a tool for constructing educational knowledge graphs from AI research papers that leverages these graphs to plan personalized learning paths and support creative ideation. The nodes in the knowledge graph we have built include papers and the prerequisite knowledge, such as concepts, skills, and tools, that they involve; the edges record the semantic information of citations. When a student queries a specific paper, a beam search-based path search method can trace the current development trends of the field from the queried paper and plan a learning path toward cutting-edge objectives. The integrated Idea Studio guides students to clarify problem statements, compare alternative designs, and provide formative feedback on novelty, clarity, feasibility, and alignment with learning objectives.

[239] PC-Sampler: Position-Aware Calibration of Decoding Bias in Masked Diffusion Models

Pengcheng Huang, Shuhao Liu, Zhenghao Liu, Yukun Yan, Shuo Wang, Zulong Chen, Tong Xiao

Main category: cs.AI

TL;DR: PC-Sampler is a novel decoding strategy for masked diffusion models that addresses limitations of existing uncertainty-based samplers by combining global trajectory planning with content-aware informativeness maximization, achieving over 10% average improvement across multiple benchmarks.

DetailsMotivation: Masked diffusion models (MDMs) show promise as non-autoregressive sequence generators, but their performance is highly sensitive to decoding strategies. Existing uncertainty-based samplers suffer from lack of global trajectory control and bias toward trivial tokens in early decoding stages, limiting MDMs' full potential.

Method: Position-Aware Confidence-Calibrated Sampling (PC-Sampler) incorporates a position-aware weighting mechanism to regulate decoding path and a calibrated confidence score to suppress premature selection of trivial tokens, unifying global trajectory planning with content-aware informativeness maximization.

Result: Extensive experiments on three advanced MDMs across seven challenging benchmarks (including logical reasoning and planning tasks) show PC-Sampler consistently outperforms existing MDM decoding strategies by more than 10% on average, significantly narrowing the performance gap with state-of-the-art autoregressive models.

Conclusion: PC-Sampler effectively addresses key limitations of current MDM decoding strategies, demonstrating substantial improvements in generation quality and bringing MDMs closer to the performance levels of autoregressive models while maintaining their non-autoregressive advantages.

Abstract: Recent advances in masked diffusion models (MDMs) have established them as powerful non-autoregressive alternatives for sequence generation. Nevertheless, our preliminary experiments reveal that the generation quality of MDMs is still highly sensitive to the choice of decoding strategy. In particular, widely adopted uncertainty-based samplers suffer from two key limitations: a lack of global trajectory control and a pronounced bias toward trivial tokens in the early stages of decoding. These shortcomings restrict the full potential of MDMs. In this work, we introduce Position-Aware Confidence-Calibrated Sampling (PC-Sampler), a novel decoding strategy that unifies global trajectory planning with content-aware informativeness maximization. PC-Sampler incorporates a position-aware weighting mechanism to regulate the decoding path and a calibrated confidence score to suppress the premature selection of trivial tokens. Extensive experiments on three advanced MDMs across seven challenging benchmarks-including logical reasoning and planning tasks-demonstrate that PC-Sampler consistently outperforms existing MDM decoding strategies by more than 10% on average, significantly narrowing the performance gap with state-of-the-art autoregressive models. All codes are available at https://github.com/NEUIR/PC-Sampler.

[240] Where to Go Next Day: Multi-scale Spatial-Temporal Decoupled Model for Mid-term Human Mobility Prediction

Zongyuan Huang, Weipeng Wang, Shaoyu Huang, Marta C. Gonzalez, Yaohui Jin, Yanyan Xu

Main category: cs.AI

TL;DR: Proposes MSTDP for mid-term mobility prediction using spatial-temporal decoupling and hierarchical modeling, achieving 62.8% MAE reduction in epidemic modeling.

DetailsMotivation: Current mobility prediction methods focus on short-term next-location prediction but lack support for broader applications like traffic management and epidemic control that require longer-term forecasts.

Method: Multi-scale Spatial-Temporal Decoupled Predictor (MSTDP) that decouples daily trajectories into location-duration chains, uses hierarchical encoder for multi-scale temporal patterns, transformer-based decoder, and spatial heterogeneous graph learner.

Result: Extensive experiments on mobile phone records from 5 cities show MSTDP significantly outperforms baselines. In Boston epidemic modeling, achieved 62.8% reduction in MAE for cumulative new cases.

Conclusion: MSTDP effectively addresses mid-term mobility prediction with superior performance, demonstrating practical value for applications like epidemic control through better spatial-temporal pattern capture.

Abstract: Predicting individual mobility patterns is crucial across various applications. While current methods mainly focus on predicting the next location for personalized services like recommendations, they often fall short in supporting broader applications such as traffic management and epidemic control, which require longer period forecasts of human mobility. This study addresses mid-term mobility prediction, aiming to capture daily travel patterns and forecast trajectories for the upcoming day or week. We propose a novel Multi-scale Spatial-Temporal Decoupled Predictor (MSTDP) designed to efficiently extract spatial and temporal information by decoupling daily trajectories into distinct location-duration chains. Our approach employs a hierarchical encoder to model multi-scale temporal patterns, including daily recurrence and weekly periodicity, and utilizes a transformer-based decoder to globally attend to predicted information in the location or duration chain. Additionally, we introduce a spatial heterogeneous graph learner to capture multi-scale spatial relationships, enhancing semantic-rich representations. Extensive experiments, including statistical physics analysis, are conducted on large-scale mobile phone records in five cities (Boston, Los Angeles, SF Bay Area, Shanghai, and Tokyo), to demonstrate MSTDP’s advantages. Applied to epidemic modeling in Boston, MSTDP significantly outperforms the best-performing baseline, achieving a remarkable 62.8% reduction in MAE for cumulative new cases.

[241] VRoPE: Rotary Position Embedding for Video Large Language Models

Zikang Liu, Longteng Guo, Yepeng Tang, Tongtian Yue, Junxian Cai, Kai Ma, Qingbin Liu, Xi Chen, Jing Liu

Main category: cs.AI

TL;DR: VRoPE is a novel positional encoding method for Video-LLMs that improves upon RoPE-3D by addressing attention bias and enabling smooth video-text transitions, achieving better performance on video understanding tasks.

DetailsMotivation: Existing RoPE adaptations for video (like RoPE-3D) suffer from positional bias in attention distribution and disruptions during video-text transitions, limiting their effectiveness for Video-LLMs.

Method: Proposes VRoPE with a more balanced encoding strategy to mitigate attention biases and ensure uniform spatial focus, plus restructures positional indices for smooth video-text token transitions.

Result: Extensive experiments show VRoPE consistently outperforms previous RoPE variants, achieving significant improvements in video understanding, temporal reasoning, and retrieval tasks.

Conclusion: VRoPE effectively addresses the limitations of existing positional encoding methods for video, providing a robust solution for Video-LLMs that handles spatiotemporal complexity and video-text integration.

Abstract: Rotary Position Embedding (RoPE) has shown strong performance in text-based Large Language Models (LLMs), but extending it to video remains a challenge due to the intricate spatiotemporal structure of video frames. Existing adaptations, such as RoPE-3D, attempt to encode spatial and temporal dimensions separately but suffer from two major limitations: positional bias in attention distribution and disruptions in video-text transitions. To overcome these issues, we propose Video Rotary Position Embedding (VRoPE), a novel positional encoding method tailored for Video-LLMs. Specifically, we introduce a more balanced encoding strategy that mitigates attention biases, ensuring a more uniform distribution of spatial focus. Additionally, our approach restructures positional indices to ensure a smooth transition between video and text tokens. Extensive experiments on different models demonstrate that VRoPE consistently outperforms previous RoPE variants, achieving significant improvements in video understanding, temporal reasoning, and retrieval tasks. Code will be available at https://github.com/johncaged/VRoPE.

[242] The StudyChat Dataset: Student Dialogues With ChatGPT in an Artificial Intelligence Course

Hunter McNichols, Fareya Ikram, Andrew Lan

Main category: cs.AI

TL;DR: StudyChat dataset captures 16,851 real student interactions with an LLM tutoring chatbot in an AI course, showing that conceptual prompting correlates with better performance while circumventing learning objectives leads to worse outcomes.

DetailsMotivation: To understand how students actually use LLM-powered tutoring tools in real educational settings and analyze the relationship between interaction patterns and academic performance.

Method: Deployed a web application replicating ChatGPT’s functionality to log student interactions during programming assignments in a university AI course, collected 16,851 interactions, and annotated them using a dialogue act labeling schema.

Result: Students who prompted LLMs for conceptual understanding and coding help performed better on assignments and exams, while those who used LLMs to write reports and circumvent learning objectives had lower exam outcomes.

Conclusion: The StudyChat dataset provides valuable insights into student-LLM interactions and serves as a shared resource for further research on LLMs’ evolving role in education, highlighting both productive and problematic usage patterns.

Abstract: The widespread availability of large language models (LLMs), such as ChatGPT, has significantly impacted education, raising both opportunities and challenges. Students can frequently interact with LLM-powered, interactive learning tools, but their usage patterns need to be monitored and understood. We introduce StudyChat, a publicly available dataset capturing real-world student interactions with an LLM-powered tutoring chatbot in a semester-long, university-level artificial intelligence (AI) course. We deploy a web application that replicates ChatGPTs core functionalities, and use it to log student interactions with the LLM while working on programming assignments. We collect 16,851 interactions, which we annotate using a dialogue act labeling schema inspired by observed interaction patterns and prior research. We analyze these interactions, highlight usage trends, and analyze how specific student behavior correlates with their course outcome. We find that students who prompt LLMs for conceptual understanding and coding help tend to perform better on assignments and exams. Moreover, students who use LLMs to write reports and circumvent assignment learning objectives have lower outcomes on exams than others. StudyChat serves as a shared resource to facilitate further research on the evolving role of LLMs in education.

[243] Hawkeye:Efficient Reasoning with Model Collaboration

Jianshu She, Zhuohao Li, Zhemin Huang, Qi Li, Peiran Xu, Haonan Li, Qirong Ho

Main category: cs.AI

TL;DR: HAWKEYE is a framework that reduces Chain-of-Thought token redundancy by using large models to generate concise CoT instructions for smaller models, achieving comparable quality with 35% fewer tokens and 3.4x speedup.

DetailsMotivation: Chain-of-Thought reasoning generates excessive intermediate tokens causing semantic redundancy, high computational costs, and latency issues that scale with output token count.

Method: HAWKEYE uses reinforcement learning to quantify CoT redundancy and distill high-density information, where a large model produces concise CoT instructions to guide a smaller model’s response generation.

Result: Achieves comparable response quality using only 35% of full CoTs, improves clarity/coherence/conciseness by ~10%, accelerates reasoning by 3.4x on math tasks, and reduces inference cost by 60%.

Conclusion: HAWKEYE effectively addresses CoT efficiency problems by eliminating redundant tokens while maintaining reasoning quality, offering significant computational savings and performance improvements.

Abstract: Chain-of-Thought (CoT) reasoning has demonstrated remarkable effectiveness in enhancing the reasoning abilities of large language models (LLMs). However, its efficiency remains a challenge due to the generation of excessive intermediate reasoning tokens, which introduce semantic redundancy and overly detailed reasoning steps. Moreover, computational expense and latency are significant concerns, as the cost scales with the number of output tokens, including those intermediate steps. In this work, we observe that most CoT tokens are unnecessary, and retaining only a small portion of them is sufficient for producing high-quality responses. Inspired by this, we propose HAWKEYE, a novel post-training and inference framework where a large model produces concise CoT instructions to guide a smaller model in response generation. HAWKEYE quantifies redundancy in CoT reasoning and distills high-density information via reinforcement learning. By leveraging these concise CoTs, HAWKEYE is able to expand responses while reducing token usage and computational cost significantly. Our evaluation shows that HAWKEYE can achieve comparable response quality using only 35% of the full CoTs, while improving clarity, coherence, and conciseness by approximately 10%. Furthermore, HAWKEYE can accelerate end-to-end reasoning by up to 3.4x on complex math tasks while reducing inference cost by up to 60%. HAWKEYE will be open-sourced and the models will be available soon.

[244] Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning

Can Jin, Hongwu Peng, Qixin Zhang, Yujin Tang, Dimitris N. Metaxas, Tong Che

Main category: cs.AI

TL;DR: A novel adaptive multi-agent framework that combines model-level training with system-level coordination to enhance collaborative reasoning, achieving significant performance improvements on complex reasoning tasks.

DetailsMotivation: Multi-agent systems built on LLMs show promise for complex tasks but lack effective scaling methods for collaboration and reasoning compared to single-agent test-time scaling advancements.

Method: Created M500 dataset with 500 multi-agent reasoning traces, fine-tuned Qwen2.5-32B to produce M1-32B model, and introduced a CEO agent for dynamic discussion management and adaptive reasoning depth adjustment.

Result: Achieved 12% improvement on GPQA-Diamond, 41% on AIME2024, and 10% on MBPP-Sanitized, matching state-of-the-art models like DeepSeek-R1 on some tasks.

Conclusion: Both learned collaboration through fine-tuning and adaptive coordination via CEO agent are crucial for scaling multi-agent reasoning performance.

Abstract: Multi-agent systems (MAS) built on large language models (LLMs) offer a promising path toward solving complex, real-world tasks that single-agent systems often struggle to manage. While recent advancements in test-time scaling (TTS) have significantly improved single-agent performance on challenging reasoning tasks, how to effectively scale collaboration and reasoning in MAS remains an open question. In this work, we introduce an adaptive multi-agent framework designed to enhance collaborative reasoning through both model-level training and system-level coordination. We construct M500, a high-quality dataset containing 500 multi-agent collaborative reasoning traces, and fine-tune Qwen2.5-32B-Instruct on this dataset to produce M1-32B, a model optimized for multi-agent collaboration. To further enable adaptive reasoning, we propose a novel CEO agent that dynamically manages the discussion process, guiding agent collaboration and adjusting reasoning depth for more effective problem-solving. Evaluated in an open-source MAS across a range of tasks-including general understanding, mathematical reasoning, and coding-our system significantly outperforms strong baselines. For instance, M1-32B achieves 12% improvement on GPQA-Diamond, 41% on AIME2024, and 10% on MBPP-Sanitized, matching the performance of state-of-the-art models like DeepSeek-R1 on some tasks. These results highlight the importance of both learned collaboration and adaptive coordination in scaling multi-agent reasoning. Code is available at https://github.com/jincan333/MAS-TTS

[245] Hierarchical Reinforcement Learning in Multi-Goal Spatial Navigation with Autonomous Mobile Robots

Brendon Johnson, Alfredo Weitzenfeld

Main category: cs.AI

TL;DR: HRL outperforms traditional RL in complex robotic navigation tasks through sub-goal creation and termination functions, with experiments showing advantages in various HRL configurations.

DetailsMotivation: To evaluate and contrast hierarchical reinforcement learning (HRL) with traditional RL in complex robotic navigation tasks, leveraging inherent hierarchy where traditional RL often fails.

Method: Constructed experiments comparing RL PPO with HRL, testing different sub-goal creation methods (manual vs automatic), and examining termination frequency effects on HRL performance.

Result: HRL demonstrates advantages over traditional RL, with experiments highlighting how sub-goal creation and termination functions contribute to improved performance in complex navigation tasks.

Conclusion: HRL effectively leverages task hierarchy through sub-goal creation and termination mechanisms, providing superior performance compared to traditional RL approaches in complex robotic navigation scenarios.

Abstract: Hierarchical reinforcement learning (HRL) is hypothesized to be able to leverage the inherent hierarchy in learning tasks where traditional reinforcement learning (RL) often fails. In this research, HRL is evaluated and contrasted with traditional RL in complex robotic navigation tasks. We evaluate unique characteristics of HRL, including its ability to create sub-goals and the termination functions. We constructed a number of experiments to test: 1) the differences between RL proximal policy optimization (PPO) and HRL, 2) different ways of creating sub-goals in HRL, 3) manual vs automatic sub-goal creation in HRL, and 4) the effects of the frequency of termination on performance in HRL. These experiments highlight the advantages of HRL over RL and how it achieves these advantages.

[246] It’s the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

Matthew Kowal, Jasper Timm, Jean-Francois Godbout, Thomas Costello, Antonio A. Arechar, Gordon Pennycook, David Rand, Adam Gleave, Kellin Pelrine

Main category: cs.AI

TL;DR: APE benchmark evaluates LLMs’ willingness to attempt persuasion on harmful topics, revealing safety gaps in current models where many frequently engage in harmful persuasion attempts.

DetailsMotivation: To address the overlooked risk of LLMs blindly following orders to persuade on harmful topics (e.g., terrorism glorification) and understand when models engage in persuasive behavior for agentic AI systems.

Method: Multi-turn conversational setup between simulated persuader and persuadee agents across diverse harmful topics, with automated evaluator model to identify persuasion willingness and measure attempt frequency.

Result: Many open and closed-weight LLMs frequently attempt persuasion on harmful topics, and jailbreaking increases this willingness, highlighting safety guardrail deficiencies.

Conclusion: Evaluating willingness to persuade is crucial for LLM risk assessment, and current safety measures have significant gaps that need addressing.

Abstract: Persuasion is a powerful capability of large language models (LLMs) that both enables beneficial applications (e.g. helping people quit smoking) and raises significant risks (e.g. large-scale, targeted political manipulation). Prior work has found models possess a significant and growing persuasive capability, measured by belief changes in simulated or real users. However, these benchmarks overlook a crucial risk factor: the propensity of a model to attempt to persuade in harmful contexts. Understanding whether a model will blindly ``follow orders’’ to persuade on harmful topics (e.g. glorifying joining a terrorist group) is key to understanding the efficacy of safety guardrails. Moreover, understanding if and when a model will engage in persuasive behavior in pursuit of some goal is essential to understanding the risks from agentic AI systems. We propose the Attempt to Persuade Eval (APE) benchmark, that shifts the focus from persuasion success to persuasion attempts, operationalized as a model’s willingness to generate content aimed at shaping beliefs or behavior. Our evaluation framework probes frontier LLMs using a multi-turn conversational setup between simulated persuader and persuadee agents. APE explores a diverse spectrum of topics including conspiracies, controversial issues, and non-controversially harmful content. We introduce an automated evaluator model to identify willingness to persuade and measure the frequency and context of persuasive attempts. We find that many open and closed-weight models are frequently willing to attempt persuasion on harmful topics and that jailbreaking can increase willingness to engage in such behavior. Our results highlight gaps in current safety guardrails and underscore the importance of evaluating willingness to persuade as a key dimension of LLM risk. APE is available at github.com/AlignmentResearch/AttemptPersuadeEval

Hudson de Martim

Main category: cs.AI

TL;DR: A temporal modeling pattern for legal norm evolution using LRMoo ontology, distinguishing between semantic Temporal Versions and concrete Language Versions, enabling precise point-in-time reconstruction of legal texts.

DetailsMotivation: To address the challenge of representing temporal evolution of legal norms at component level for reliable AI applications, overcoming limitations of current frameworks and generative models.

Method: Proposes a pattern using LRMoo ontology with diachronic chains of F2 Expressions, distinguishing Temporal Versions (semantic snapshots) from Language Versions (monolingual realizations), applied recursively to legal text structure with formal amendment process tracing.

Result: Case study on Brazilian Federal Constitution demonstrates fine-grained, event-centric architecture enables precise deterministic retrieval and reconstruction of any legal text part at specific dates.

Conclusion: Provides robust foundation for verifiable knowledge graphs and advanced AI tools by enabling deterministic point-in-time reconstruction of legal texts through granular versioning pattern.

Abstract: Effectively representing the temporal evolution of legal norms at the component level is a critical challenge. While frameworks like IFLA LRMoo and standards like Akoma Ntoso provide generic toolkits, a dedicated pattern for granular versioning is needed to enable the deterministic point-in-time reconstruction of legal texts required by reliable AI applications. This paper proposes a temporal modeling pattern grounded in the LRMoo ontology that models a norm’s evolution as a diachronic chain of F2 Expressions. We introduce a key distinction between a language-agnostic Temporal Version (TV) - a semantic snapshot of the norm’s structure - and its concrete monolingual realizations, the Language Versions (LV). Both are modeled as F2 Expressions linked by the canonical R76 is derivative of property. The model applies this paradigm recursively, representing the legal text’s internal structure as a parallel hierarchy of abstract Component Works (F1 Work) and their versioned Component Expressions (F2 Expression). Furthermore, we formalize the amendment process using the F28 Expression Creation event, allowing changes to be traced from a specific provision in an amending act to its precise effect on the amended norm. A case study on the Brazilian Federal Constitution demonstrates how this fine-grained, event-centric architecture enables the precise, deterministic retrieval and reconstruction of any part of a legal text at a specific date. The model provides a robust foundation for building verifiable knowledge graphs and advanced AI tools, overcoming the limitations of current generative models.

[248] Efficient Network Automatic Relevance Determination

Hongwei Zhang, Ziqi Ye, Xinyuan Wang, Xin Guo, Zenglin Xu, Yuan Cheng, Zixin Hu, Yuan Qi

Main category: cs.AI

TL;DR: NARD extends Automatic Relevance Determination to linear probabilistic models for sparse input-output relationships while capturing output correlations, with efficient computational methods that reduce complexity from O(m³+d³) to O(m³+p²) per iteration.

DetailsMotivation: To simultaneously model sparse relationships between inputs and outputs while capturing output correlations, addressing computational inefficiencies in traditional ARD methods for high-dimensional data.

Method: Uses matrix normal prior with sparsity-inducing parameter, iteratively updates precision matrix and relationships. Introduces Sequential NARD for feature evaluation and Surrogate Function Method for efficient marginal likelihood approximation.

Result: Significant computational efficiency improvements with comparable performance on synthetic and real-world datasets, reducing iteration costs to O(m³+p³), O(m³+d²), and O(m³+p²) respectively.

Conclusion: NARD provides an efficient framework for sparse linear probabilistic modeling with output correlation capture, offering substantial computational savings while maintaining performance.

Abstract: We propose Network Automatic Relevance Determination (NARD), an extension of ARD for linearly probabilistic models, to simultaneously model sparse relationships between inputs $X \in \mathbb R^{d \times N}$ and outputs $Y \in \mathbb R^{m \times N}$, while capturing the correlation structure among the $Y$. NARD employs a matrix normal prior which contains a sparsity-inducing parameter to identify and discard irrelevant features, thereby promoting sparsity in the model. Algorithmically, it iteratively updates both the precision matrix and the relationship between $Y$ and the refined inputs. To mitigate the computational inefficiencies of the $\mathcal O(m^3 + d^3)$ cost per iteration, we introduce Sequential NARD, which evaluates features sequentially, and a Surrogate Function Method, leveraging an efficient approximation of the marginal likelihood and simplifying the calculation of determinant and inverse of an intermediate matrix. Combining the Sequential update with the Surrogate Function method further reduces computational costs. The computational complexity per iteration for these three methods is reduced to $\mathcal O(m^3+p^3)$, $\mathcal O(m^3 + d^2)$, $\mathcal O(m^3+p^2)$, respectively, where $p \ll d$ is the final number of features in the model. Our methods demonstrate significant improvements in computational efficiency with comparable performance on both synthetic and real-world datasets.

[249] Dispositions and Roles of Generically Dependent Entities

Fabian Neuhaus

Main category: cs.AI

TL;DR: BFO 2020 lacks support for functions, dispositions, and roles of generically dependent continuants like software/datasets, limiting adequate representation of computer models and data roles. Two solutions proposed: defined classes or BFO modifications.

DetailsMotivation: BFO 2020's inability to represent realizable entities of generically dependent continuants prevents proper modeling of software functions, dataset roles in computer models, and roles/dispositions of immaterial entities like boundaries and sites.

Method: Two approaches presented: (1) using defined classes to work around limitations, and (2) proposing structural changes to BFO 2020 to directly support functions, dispositions, and roles for generically dependent continuants and immaterial entities.

Result: The paper identifies specific limitations in BFO 2020’s ontology framework and provides both temporary workarounds (defined classes) and fundamental solutions (BFO modifications) to enable proper representation of software functions, dataset roles, and immaterial entity properties.

Conclusion: BFO 2020 requires extension or modification to adequately represent the functional and dispositional properties of generically dependent continuants and immaterial entities, which is crucial for comprehensive ontological modeling of modern computational systems and their components.

Abstract: BFO 2020 does not support functions, dispositions, and roles of generically dependent continuants (like software or datasets). In this paper, we argue that this is a severe limitation, which prevents, for example, the adequate representation of the functions of computer models or the various roles of datasets during the execution of these models. We discuss the aspects of BFO 2020 that prevent the representation of realizable entities of generically dependent continuants. Two approaches to address the issue are presented: (a) the use of defined classes and (b) a proposal of changes that allow BFO to support functions, dispositions, and roles of generically dependent continuants. The latter also addresses limitations of BFO 2020 concerning the roles and dispositions of immaterial entities, particularly boundaries and sites.

[250] Towards Urban Planing AI Agent in the Age of Agentic AI

Yanjie Fu, Dongjie Wang

Main category: cs.AI

TL;DR: The paper identifies limitations in current generative AI approaches to urban planning and proposes a new direction combining agentic AI with participatory urbanism.

DetailsMotivation: To address the gaps in existing generative urban planning studies where AI structures are predefined by humans and ignore domain expert tools, creating an opportunity for more effective AI urban planners.

Method: The paper analyzes current generative AI approaches (adversarial networks, diffusion models, hierarchical structures) and identifies their limitations, then outlines a research direction for agentic urban AI planners that integrates domain expert tools and participatory approaches.

Result: The analysis reveals that current generative urban planning AI requires predefined structures by humans and ignores valuable domain-specific tools developed by urban planning practitioners.

Conclusion: A new synthesis of agentic AI and participatory urbanism is needed to create more effective AI urban planners that leverage domain expertise and overcome current limitations in generative approaches.

Abstract: Generative AI, large language models, and agentic AI have emerged separately of urban planning. However, the convergence between AI and urban planning presents an interesting opportunity towards AI urban planners. Existing studies conceptualizes urban planning as a generative AI task, where AI synthesizes land-use configurations under geospatial, social, and human-centric constraints and reshape automated urban design. We further identify critical gaps of existing generative urban planning studies: 1) the generative structure has to be predefined with strong assumption: all of adversarial generator-discriminator, forward and inverse diffusion structures, hierarchical zone-POI generative structure are predefined by humans; 2) ignore the power of domain expert developed tools: domain urban planners have developed various tools in the urban planning process guided by urban theory, while existing pure neural networks based generation ignore the power of the tools developed by urban planner practitioners. To address these limitations, we outline a future research direction agentic urban AI planner, calling for a new synthesis of agentic AI and participatory urbanism.

[251] Data-Efficient Safe Policy Improvement Using Parametric Structure

Kasper Engelen, Guillermo A. Pérez, Marnix Suilen

Main category: cs.AI

TL;DR: A parametric safe policy improvement approach that leverages transition dynamics correlations and action pruning to dramatically improve data efficiency in offline reinforcement learning.

DetailsMotivation: Standard SPI methods in MDPs don't utilize known parametric dependencies between transition distributions, leading to inefficient data usage despite available structural information.

Method: Three techniques: (1) parametric SPI algorithm exploiting distribution correlations for better transition estimation, (2) game-based abstraction for action pruning, (3) SMT-based preprocessing for advanced action pruning.

Result: Empirical results show multiple orders of magnitude improvement in data efficiency while maintaining reliability guarantees.

Conclusion: Leveraging parametric dependencies and action pruning techniques significantly enhances SPI data efficiency without compromising safety guarantees.

Abstract: Safe policy improvement (SPI) is an offline reinforcement learning problem in which a new policy that reliably outperforms the behavior policy with high confidence needs to be computed using only a dataset and the behavior policy. Markov decision processes (MDPs) are the standard formalism for modeling environments in SPI. In many applications, additional information in the form of parametric dependencies between distributions in the transition dynamics is available. We make SPI more data-efficient by leveraging these dependencies through three contributions: (1) a parametric SPI algorithm that exploits known correlations between distributions to more accurately estimate the transition dynamics using the same amount of data; (2) a preprocessing technique that prunes redundant actions from the environment through a game-based abstraction; and (3) a more advanced preprocessing technique, based on satisfiability modulo theory (SMT) solving, that can identify more actions to prune. Empirical results and an ablation study show that our techniques increase the data efficiency of SPI by multiple orders of magnitude while maintaining the same reliability guarantees.

[252] Modeling Uncertainty: Constraint-Based Belief States in Imperfect-Information Games

Achille Morenville, Éric Piette

Main category: cs.AI

TL;DR: Constraint-based belief representation performs comparably to probabilistic methods in imperfect-information games with hidden piece identities, suggesting simpler constraint-based approaches may be sufficient for effective decision-making.

DetailsMotivation: To address the challenge of decision-making with partial knowledge in imperfect-information games by exploring belief representation methods that reduce the need for game-specific inference logic.

Method: Two approaches were investigated: 1) constraint-based model using Constraint Satisfaction Problems, and 2) probabilistic extension using Belief Propagation for marginal probability estimation. Both were evaluated using general-purpose agents across two different games.

Result: Constraint-based beliefs yielded results comparable to probabilistic inference, with minimal differences in agent performance between the two approaches.

Conclusion: Constraint-based belief states alone may suffice for effective decision-making in many imperfect-information game settings, potentially simplifying agent design.

Abstract: In imperfect-information games, agents must make decisions based on partial knowledge of the game state. The Belief Stochastic Game model addresses this challenge by delegating state estimation to the game model itself. This allows agents to operate on externally provided belief states, thereby reducing the need for game-specific inference logic. This paper investigates two approaches to represent beliefs in games with hidden piece identities: a constraint-based model using Constraint Satisfaction Problems and a probabilistic extension using Belief Propagation to estimate marginal probabilities. We evaluated the impact of both representations using general-purpose agents across two different games. Our findings indicate that constraint-based beliefs yield results comparable to those of probabilistic inference, with minimal differences in agent performance. This suggests that constraint-based belief states alone may suffice for effective decision-making in many settings.

[253] DualSG: A Dual-Stream Explicit Semantic-Guided Multivariate Time Series Forecasting Framework

Kuiye Ding, Fanda Fan, Yao Wang, Ruijie jian, Xiaorui Wang, Luqi Gong, Yishan Jiang, Chunjie Luo, Jianfeng Zhan

Main category: cs.AI

TL;DR: DualSG framework uses LLMs as semantic guides to refine traditional time series forecasts rather than replacing them, achieving superior performance through explicit semantic guidance and interpretable time series captions.

DetailsMotivation: Existing LLM-based time series forecasting methods either lose numerical precision by treating LLMs as end-to-end forecasters or struggle with modality alignment in latent space, requiring a better approach that leverages LLMs' semantic reasoning without compromising numerical accuracy.

Method: Proposes DualSG dual-stream framework with LLMs as Semantic Guides, introduces Time Series Caption for explicit trend pattern summarization in natural language, and designs caption-guided fusion module for inter-variable relationship modeling with reduced noise and computation.

Result: Experiments on diverse real-world datasets show DualSG consistently outperforms 15 state-of-the-art baselines, demonstrating the effectiveness of combining numerical forecasting with explicit semantic guidance.

Conclusion: Treating LLMs as semantic guidance modules rather than standalone forecasters provides better performance by explicitly combining numerical forecasting strengths with semantic reasoning capabilities through interpretable natural language context.

Abstract: Multivariate Time Series Forecasting plays a key role in many applications. Recent works have explored using Large Language Models for MTSF to take advantage of their reasoning abilities. However, many methods treat LLMs as end-to-end forecasters, which often leads to a loss of numerical precision and forces LLMs to handle patterns beyond their intended design. Alternatively, methods that attempt to align textual and time series modalities within latent space frequently encounter alignment difficulty. In this paper, we propose to treat LLMs not as standalone forecasters, but as semantic guidance modules within a dual-stream framework. We propose DualSG, a dual-stream framework that provides explicit semantic guidance, where LLMs act as Semantic Guides to refine rather than replace traditional predictions. As part of DualSG, we introduce Time Series Caption, an explicit prompt format that summarizes trend patterns in natural language and provides interpretable context for LLMs, rather than relying on implicit alignment between text and time series in the latent space. We also design a caption-guided fusion module that explicitly models inter-variable relationships while reducing noise and computation. Experiments on real-world datasets from diverse domains show that DualSG consistently outperforms 15 state-of-the-art baselines, demonstrating the value of explicitly combining numerical forecasting with semantic guidance.

[254] Scaling Up without Fading Out: Goal-Aware Sparse GNN for RL-based Generalized Planning

Sangwoo Jeon, Juchul Shin, Gyeong-Tae Kim, YeonJe Cho, Seongwoo Kim

Main category: cs.AI

TL;DR: Proposes sparse goal-aware GNN representation to overcome limitations of dense graph approaches in generalized planning, enabling scaling to larger grid environments with improved generalization.

DetailsMotivation: Existing GNN-based planning approaches use fully connected graphs that cause combinatorial explosion in edges, memory issues, and diluted node information as problem scales increase, making large-scale planning infeasible.

Method: Developed a sparse, goal-aware GNN representation that selectively encodes relevant local relationships and explicitly integrates spatial features related to goals. Validated using novel drone mission scenarios in grid world environments based on PDDL.

Result: Method scales effectively to larger grid sizes previously infeasible with dense representations, substantially improves policy generalization and success rates in drone mission scenarios.

Conclusion: Provides practical foundation for addressing realistic large-scale generalized planning tasks by overcoming scalability limitations of traditional dense graph representations through sparse, goal-aware encoding.

Abstract: Generalized planning using deep reinforcement learning (RL) combined with graph neural networks (GNNs) has shown promising results in various symbolic planning domains described by PDDL. However, existing approaches typically represent planning states as fully connected graphs, leading to a combinatorial explosion in edge information and substantial sparsity as problem scales grow, especially evident in large grid-based environments. This dense representation results in diluted node-level information, exponentially increases memory requirements, and ultimately makes learning infeasible for larger-scale problems. To address these challenges, we propose a sparse, goal-aware GNN representation that selectively encodes relevant local relationships and explicitly integrates spatial features related to the goal. We validate our approach by designing novel drone mission scenarios based on PDDL within a grid world, effectively simulating realistic mission execution environments. Our experimental results demonstrate that our method scales effectively to larger grid sizes previously infeasible with dense graph representations and substantially improves policy generalization and success rates. Our findings provide a practical foundation for addressing realistic, large-scale generalized planning tasks.

[255] FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction

Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, Mingren Yin, Zhenwei Zhu, Tianle Cai, Zehui Chen, Jiecao Chen, Yantao Du, Xiang Gao, Jiacheng Guo, Liang Hu, Jianpeng Jiao, Xiangsheng Li, Jingkai Liu, Shuang Ni, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xin Zhou, Jose Blanchet, Xipeng Qiu, Mengdi Wang, Wenhao Huang

Main category: cs.AI

TL;DR: FutureX is a dynamic live benchmark for evaluating LLM agents on future prediction tasks, featuring real-time updates and automated pipelines to prevent data contamination, with comprehensive evaluation of 25 models.

DetailsMotivation: No large-scale benchmark exists for evaluating LLM agents on future prediction due to challenges with real-time updates and timely information retrieval, despite the importance of this complex analytical task.

Method: Developed FutureX benchmark with automated pipeline for question gathering and answer collection, supporting real-time daily updates. Evaluated 25 LLM/agent models including reasoning, search capabilities, and external tool integration.

Result: Comprehensive evaluation assessed agents’ adaptive reasoning in dynamic environments, identified failure modes including vulnerability to fake web pages and temporal validity issues.

Conclusion: FutureX establishes a dynamic, contamination-free evaluation standard to drive development of LLM agents capable of professional-level predictive thinking and complex reasoning.

Abstract: Future prediction is a complex task for LLM agents, requiring a high level of analytical thinking, information gathering, contextual understanding, and decision-making under uncertainty. Agents must not only gather and interpret vast amounts of dynamic information but also integrate diverse data sources, weigh uncertainties, and adapt predictions based on emerging trends, just as human experts do in fields like politics, economics, and finance. Despite its importance, no large-scale benchmark exists for evaluating agents on future prediction, largely due to challenges in handling real-time updates and retrieving timely, accurate answers. To address this, we introduce $\textbf{FutureX}$, a dynamic and live evaluation benchmark specifically designed for LLM agents performing future prediction tasks. FutureX is the largest and most diverse live benchmark for future prediction, supporting real-time daily updates and eliminating data contamination through an automated pipeline for question gathering and answer collection. We evaluate 25 LLM/agent models, including those with reasoning, search capabilities, and integration of external tools such as the open-source Deep Research Agent and closed-source Deep Research models. This comprehensive evaluation assesses agents’ adaptive reasoning and performance in dynamic environments. Additionally, we provide in-depth analyses of agents’ failure modes and performance pitfalls in future-oriented tasks, including the vulnerability to fake web pages and the temporal validity. Our goal is to establish a dynamic, contamination-free evaluation standard that drives the development of LLM agents capable of performing at the level of professional human analysts in complex reasoning and predictive thinking.

[256] Modeling Relational Logic Circuits for And-Inverter Graph Convolutional Network

Weihao Sun

Main category: cs.AI

TL;DR: AIGer is a novel framework that combines node logic feature initialization and heterogeneous graph convolutional networks to jointly model functional and structural characteristics of And-Inverter Graphs, achieving significant performance improvements in circuit analysis tasks.

DetailsMotivation: Existing methods struggle with accurate modeling of complex, large-scale And-Inverter Graphs due to their inability to jointly model functional and structural characteristics and insufficient dynamic information propagation capabilities.

Method: AIGer consists of two components: 1) Node logic feature initialization embedding that projects logic nodes into semantic spaces, and 2) AIGs feature learning network using heterogeneous graph convolutional networks with dynamic relationship weight matrices and differentiated information aggregation.

Result: AIGer outperforms state-of-the-art models, improving MAE by 18.95% and MSE by 44.44% in Signal Probability Prediction, and achieving 33.57% MAE and 14.79% MSE improvements in Truth Table Distance Prediction.

Conclusion: The proposed AIGer framework effectively addresses the challenges of joint functional-structural modeling in complex AIGs and demonstrates superior performance in key EDA tasks, representing a significant advancement in automated logic circuit design.

Abstract: The automation of logic circuit design enhances chip performance, energy efficiency, and reliability, and is widely applied in the field of Electronic Design Automation (EDA).And-Inverter Graphs (AIGs) efficiently represent, optimize, and verify the functional characteristics of digital circuits, enhancing the efficiency of EDA development.Due to the complex structure and large scale of nodes in real-world AIGs, accurate modeling is challenging, leading to existing work lacking the ability to jointly model functional and structural characteristics, as well as insufficient dynamic information propagation capability.To address the aforementioned challenges, we propose AIGer.Specifically, AIGer consists of two components: 1) Node logic feature initialization embedding component and 2) AIGs feature learning network component.The node logic feature initialization embedding component projects logic nodes, such as AND and NOT, into independent semantic spaces, to enable effective node embedding for subsequent processing.Building upon this, the AIGs feature learning network component employs a heterogeneous graph convolutional network, designing dynamic relationship weight matrices and differentiated information aggregation approaches to better represent the original structure and information of AIGs.The combination of these two components enhances AIGer’s ability to jointly model functional and structural characteristics and improves its message passing capability. Experimental results indicate that AIGer outperforms the current best models in the Signal Probability Prediction (SSP) task, improving MAE and MSE by 18.95% and 44.44%, respectively. In the Truth Table Distance Prediction (TTDP) task, AIGer achieves improvements of 33.57% and 14.79% in MAE and MSE, respectively, compared to the best-performing models.

cs.SD

[257] Is Transfer Learning Necessary for Violin Transcription?

Yueh-Po Peng, Ting-Kang Wang, Li Su, Vincent K. M. Cheung

Main category: cs.SD

TL;DR: Violin transcription models trained from scratch on 30 hours of violin data perform competitively with piano-pretrained fine-tuned models, showing instrument-specific training can be effective without piano transfer learning.

DetailsMotivation: Violin automatic music transcription (AMT) lags behind piano AMT due to limited annotated data. The effectiveness of transferring piano-pretrained models to violin remains unclear given timbral and articulatory differences between instruments.

Method: Used a piano transcription architecture without modification, trained from scratch on the MOSA dataset containing ~30 hours of aligned violin recordings. Compared against fine-tuned piano-pretrained models on URMP and Bach10 datasets.

Result: Models trained from scratch achieved competitive or superior performance compared to fine-tuned piano-pretrained counterparts on violin transcription tasks.

Conclusion: Strong violin AMT is possible without relying on pretrained piano representations, emphasizing the importance of instrument-specific data collection and augmentation strategies rather than transfer learning from piano models.

Abstract: Automatic music transcription (AMT) has achieved remarkable progress for instruments such as the piano, largely due to the availability of large-scale, high-quality datasets. In contrast, violin AMT remains underexplored due to limited annotated data. A common approach is to fine-tune pretrained models for other downstream tasks, but the effectiveness of such transfer remains unclear in the presence of timbral and articulatory differences. In this work, we investigate whether training from scratch on a medium-scale violin dataset can match the performance of fine-tuned piano-pretrained models. We adopt a piano transcription architecture without modification and train it on the MOSA dataset, which contains about 30 hours of aligned violin recordings. Our experiments on URMP and Bach10 show that models trained from scratch achieved competitive or even superior performance compared to fine-tuned counterparts. These findings suggest that strong violin AMT is possible without relying on pretrained piano representations, highlighting the importance of instrument-specific data collection and augmentation strategies.

[258] Leveraging Mamba with Full-Face Vision for Audio-Visual Speech Enhancement

Rong Chao, Wenze Ren, You-Jin Li, Kuo-Hsuan Hung, Sung-Feng Huang, Szu-Wei Fu, Wen-Huang Cheng, Yu Tsao

Main category: cs.SD

TL;DR: AVSEMamba is an audio-visual speech enhancement model that combines Mamba-based temporal modeling with visual cues to solve the cocktail party problem, achieving state-of-the-art performance on benchmark datasets.

DetailsMotivation: Existing Mamba-based speech enhancement models like SEMamba are limited to single-speaker scenarios and struggle with complex multi-speaker environments such as the cocktail party problem.

Method: Integrates full-face visual cues with a Mamba-based temporal backbone to leverage spatiotemporal visual information for more accurate target speech extraction in challenging conditions.

Result: Outperforms other monaural baselines on AVSEC-4 Challenge development and blind test sets in speech intelligibility (STOI), perceptual quality (PESQ), and non-intrusive quality (UTMOS), achieving 1st place on the monaural leaderboard.

Conclusion: Audio-visual integration with Mamba-based modeling effectively addresses multi-speaker speech enhancement challenges, demonstrating superior performance in complex acoustic environments.

Abstract: Recent Mamba-based models have shown promise in speech enhancement by efficiently modeling long-range temporal dependencies. However, models like Speech Enhancement Mamba (SEMamba) remain limited to single-speaker scenarios and struggle in complex multi-speaker environments such as the cocktail party problem. To overcome this, we introduce AVSEMamba, an audio-visual speech enhancement model that integrates full-face visual cues with a Mamba-based temporal backbone. By leveraging spatiotemporal visual information, AVSEMamba enables more accurate extraction of target speech in challenging conditions. Evaluated on the AVSEC-4 Challenge development and blind test sets, AVSEMamba outperforms other monaural baselines in speech intelligibility (STOI), perceptual quality (PESQ), and non-intrusive quality (UTMOS), and achieves \textbf{1st place} on the monaural leaderboard.

[259] DegDiT: Controllable Audio Generation with Dynamic Event Graph Guided Diffusion Transformer

Yisu Liu, Chenxing Li, Wanqian Zhang, Wenfu Wang, Meng Yu, Ruibo Fu, Zheng Lin, Weiping Wang, Dong Yu

Main category: cs.SD

TL;DR: DegDiT is a dynamic event graph-guided diffusion transformer framework for open-vocabulary controllable audio generation that uses structured dynamic graphs to represent events and achieves state-of-the-art performance.

DetailsMotivation: Existing text-to-audio generation methods face trade-offs between accurate temporal localization, open-vocabulary scalability, and practical efficiency, requiring a more effective solution for precise audio control.

Method: Encodes events as structured dynamic graphs with nodes representing semantic features, temporal attributes, and inter-event connections; uses graph transformer for contextualized embeddings; employs quality-balanced data selection and consensus preference optimization.

Result: Achieves state-of-the-art performances on AudioCondition, DESED, and AudioTime datasets across various objective and subjective evaluation metrics.

Conclusion: DegDiT effectively addresses the challenges of controllable audio generation by leveraging dynamic event graphs and consensus optimization, demonstrating superior performance in both content and temporal control.

Abstract: Controllable text-to-audio generation aims to synthesize audio from textual descriptions while satisfying user-specified constraints, including event types, temporal sequences, and onset and offset timestamps. This enables precise control over both the content and temporal structure of the generated audio. Despite recent progress, existing methods still face inherent trade-offs among accurate temporal localization, open-vocabulary scalability, and practical efficiency. To address these challenges, we propose DegDiT, a novel dynamic event graph-guided diffusion transformer framework for open-vocabulary controllable audio generation. DegDiT encodes the events in the description as structured dynamic graphs. The nodes in each graph are designed to represent three aspects: semantic features, temporal attributes, and inter-event connections. A graph transformer is employed to integrate these nodes and produce contextualized event embeddings that serve as guidance for the diffusion model. To ensure high-quality and diverse training data, we introduce a quality-balanced data selection pipeline that combines hierarchical event annotation with multi-criteria quality scoring, resulting in a curated dataset with semantic diversity. Furthermore, we present consensus preference optimization, facilitating audio generation through consensus among multiple reward signals. Extensive experiments on AudioCondition, DESED, and AudioTime datasets demonstrate that DegDiT achieves state-of-the-art performances across a variety of objective and subjective evaluation metrics.

[260] Evaluating Identity Leakage in Speaker De-Identification Systems

Seungmin Seo, Oleg Aulov, Afzal Godil, Kevin Mangold

Main category: cs.SD

TL;DR: Current speaker de-identification systems all leak identity information, with the best performing system only slightly better than random guessing and the worst achieving 45% hit rate in top 50 candidates.

DetailsMotivation: To benchmark and quantify residual identity leakage in speaker de-identification systems to assess privacy risks.

Method: Introduced a benchmark with three complementary error rates: equal error rate, cumulative match characteristic hit rate, and embedding-space similarity via canonical correlation analysis and Procrustes analysis.

Result: All state-of-the-art speaker de-identification systems leak identity information, with performance ranging from slightly better than random guessing to 45% hit rate within top 50 candidates.

Conclusion: Current speaker de-identification technologies have persistent privacy risks due to significant identity leakage.

Abstract: Speaker de-identification aims to conceal a speaker’s identity while preserving intelligibility of the underlying speech. We introduce a benchmark that quantifies residual identity leakage with three complementary error rates: equal error rate, cumulative match characteristic hit rate, and embedding-space similarity measured via canonical correlation analysis and Procrustes analysis. Evaluation results reveal that all state-of-the-art speaker de-identification systems leak identity information. The highest performing system in our evaluation performs only slightly better than random guessing, while the lowest performing system achieves a 45% hit rate within the top 50 candidates based on CMC. These findings highlight persistent privacy risks in current speaker de-identification technologies.

[261] Adaptation and Optimization of Automatic Speech Recognition (ASR) for the Maritime Domain in the Field of VHF Communication

Emin Cagatay Nakilcioglu, Maximilian Reimann, Ole John

Main category: cs.SD

TL;DR: Multilingual ASR system for maritime VHF radio communication using deep learning to convert radio signals to text

DetailsMotivation: Address challenges in maritime radio communication by automating speech-to-text conversion for VHF signals

Method: Deep learning architecture (marFM) combining audio processing techniques and machine learning algorithms

Result: Evaluated transcription performance on various maritime radio data

Conclusion: Proposed ASR model shows promise for automating maritime radio communication transcription

Abstract: This paper introduces a multilingual automatic speech recognizer (ASR) for maritime radio communi-cation that automatically converts received VHF radio signals into text. The challenges of maritime radio communication are described at first, and the deep learning architecture of marFM consisting of audio processing techniques and machine learning algorithms is presented. Subsequently, maritime radio data of interest is analyzed and then used to evaluate the transcription performance of our ASR model for various maritime radio data.

[262] AxLSTMs: learning self-supervised audio representations with xLSTMs

Sarthak Yadav, Sergios Theodoridis, Zheng-Hua Tan

Main category: cs.SD

TL;DR: Audio xLSTM (AxLSTM) applies the extended LSTM architecture to self-supervised audio representation learning, outperforming transformer baselines with fewer parameters.

DetailsMotivation: While xLSTMs have shown competitive performance to transformers in other domains, their effectiveness for self-supervised audio representation learning hasn't been evaluated, despite the transformer's limitations.

Method: Proposes Audio xLSTM (AxLSTM) that learns audio representations from masked spectrogram patches in a self-supervised setting, pretrained on AudioSet dataset.

Result: AxLSTM models outperform comparable self-supervised audio spectrogram transformer (SSAST) baselines by up to 25% relative performance across ten diverse downstream tasks while having up to 45% fewer parameters.

Conclusion: xLSTM architecture is viable and effective for self-supervised general-purpose audio representation learning, offering superior performance with reduced parameter count compared to transformer approaches.

Abstract: While the transformer has emerged as the eminent neural architecture, several independent lines of research have emerged to address its limitations. Recurrent neural approaches have observed a lot of renewed interest, including the extended long short-term memory (xLSTM) architecture, which reinvigorates the original LSTM. However, while xLSTMs have shown competitive performance compared to the transformer, their viability for learning self-supervised general-purpose audio representations has not been evaluated. This work proposes Audio xLSTM (AxLSTM), an approach for learning audio representations from masked spectrogram patches in a self-supervised setting. Pretrained on the AudioSet dataset, the proposed AxLSTM models outperform comparable self-supervised audio spectrogram transformer (SSAST) baselines by up to 25% in relative performance across a set of ten diverse downstream tasks while having up to 45% fewer parameters.

[263] VoiceCloak: A Multi-Dimensional Defense Framework against Unauthorized Diffusion-based Voice Cloning

Qianyue Hu, Junyan Wu, Wei Lu, Xiangyang Luo

Main category: cs.SD

TL;DR: VoiceCloak is a proactive defense framework that protects against unauthorized voice cloning by diffusion models through adversarial perturbations that obfuscate speaker identity and degrade output quality.

DetailsMotivation: Diffusion models enable highly realistic voice cloning but create security risks for malicious misuse. Existing defenses are incompatible with diffusion models' complex generative mechanisms.

Method: VoiceCloak introduces adversarial perturbations to reference audio by: 1) distorting speaker identity embeddings using auditory perception principles, 2) disrupting conditional guidance processes like attention context, 3) amplifying score magnitude to steer generation away from high-quality speech, and 4) employing noise-guided semantic corruption.

Result: Extensive experiments show VoiceCloak achieves outstanding defense success rates against unauthorized diffusion-based voice cloning attacks.

Conclusion: VoiceCloak effectively bridges the gap in proactive defense for diffusion-based voice cloning by targeting specific vulnerabilities in diffusion models through multi-dimensional adversarial perturbations.

Abstract: Diffusion Models (DMs) have achieved remarkable success in realistic voice cloning (VC), while they also increase the risk of malicious misuse. Existing proactive defenses designed for traditional VC models aim to disrupt the forgery process, but they have been proven incompatible with DMs due to the intricate generative mechanisms of diffusion. To bridge this gap, we introduce VoiceCloak, a multi-dimensional proactive defense framework with the goal of obfuscating speaker identity and degrading perceptual quality in potential unauthorized VC. To achieve these goals, we conduct a focused analysis to identify specific vulnerabilities within DMs, allowing VoiceCloak to disrupt the cloning process by introducing adversarial perturbations into the reference audio. Specifically, to obfuscate speaker identity, VoiceCloak first targets speaker identity by distorting representation learning embeddings to maximize identity variation, which is guided by auditory perception principles. Additionally, VoiceCloak disrupts crucial conditional guidance processes, particularly attention context, thereby preventing the alignment of vocal characteristics that are essential for achieving convincing cloning. Then, to address the second objective, VoiceCloak introduces score magnitude amplification to actively steer the reverse trajectory away from the generation of high-quality speech. Noise-guided semantic corruption is further employed to disrupt structural speech semantics captured by DMs, degrading output quality. Extensive experiments highlight VoiceCloak’s outstanding defense success rate against unauthorized diffusion-based voice cloning. Audio samples of VoiceCloak are available at https://voice-cloak.github.io/VoiceCloak/.

[264] Multi-Sampling-Frequency Naturalness MOS Prediction Using Self-Supervised Learning Model with Sampling-Frequency-Independent Layer

Go Nishikawa, Wataru Nakata, Yuki Saito, Kanami Imamura, Hiroshi Saruwatari, Tomohiko Nakamura

Main category: cs.SD

TL;DR: A MOS prediction model for speech with multiple sampling frequencies using SF-independent convolutional layers and SSL, achieving top rankings in AMC 2025 Track 3.

DetailsMotivation: To address the challenge of mean opinion score prediction for speech with varying sampling frequencies in the AudioMOS Challenge.

Method: Integrates SF-independent convolutional layers into SSL model, uses knowledge distillation from pretrained non-SFI-SSL model, and pretrains with large-scale MOS dataset.

Result: Ranked first in one evaluation metric and fourth in final ranking of AMC 2025 Track 3.

Conclusion: The proposed SFI approach combined with knowledge distillation and large-scale pretraining effectively handles MOS prediction across multiple sampling frequencies.

Abstract: We introduce our submission to the AudioMOS Challenge (AMC) 2025 Track 3: mean opinion score (MOS) prediction for speech with multiple sampling frequencies (SFs). Our submitted model integrates an SF-independent (SFI) convolutional layer into a self-supervised learning (SSL) model to achieve SFI speech feature extraction for MOS prediction. We present some strategies to improve the MOS prediction performance of our model: distilling knowledge from a pretrained non-SFI-SSL model and pretraining with a large-scale MOS dataset. Our submission to the AMC 2025 Track 3 ranked the first in one evaluation metric and the fourth in the final ranking. We also report the results of our ablation study to investigate essential factors of our model.

[265] What Matters for Bioacoustic Encoding

Marius Miron, David Robinson, Milad Alizadeh, Ellen Gilsenan-McMahon, Gagan Narula, Emmanuel Chemla, Maddie Cusimano, Felix Effenberger, Masato Hagiwara, Benjamin Hoffman, Sara Keen, Diane Kim, Jane Lawton, Jen-Yu Liu, Aza Raskin, Olivier Pietquin, Matthieu Geist

Main category: cs.SD

TL;DR: Large-scale study on bioacoustic encoders showing that self-supervised pre-training followed by supervised training on diverse audio data yields state-of-the-art performance across 26 bioacoustic tasks.

DetailsMotivation: Bioacoustic tasks suffer from limited annotated data and existing encoders are limited in scope (focusing mainly on birds), architecture diversity, and evaluation breadth.

Method: Self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus, evaluated across 26 datasets covering species classification, detection, individual ID, and vocal repertoire discovery.

Result: Achieved state-of-the-art performance on existing and proposed benchmarks, demonstrating the importance of data diversity in both training stages for strong in- and out-of-distribution performance.

Conclusion: Identified key factors for training effective bioacoustic encoders and will release model checkpoints to support ongoing research and applications in bioacoustics.

Abstract: Bioacoustics, the study of sounds produced by living organisms, plays a vital role in conservation, biodiversity monitoring, and behavioral studies. Many tasks in this field, such as species, individual, and behavior classification and detection, are well-suited to machine learning. However, they often suffer from limited annotated data, highlighting the need for a general-purpose bioacoustic encoder capable of extracting useful representations for diverse downstream tasks. Such encoders have been proposed before, but are often limited in scope due to a focus on a narrow range of species (typically birds), and a reliance on a single model architecture or training paradigm. Moreover, they are usually evaluated on a small set of tasks and datasets. In this work, we present a large-scale empirical study that covers aspects of bioacoustics that are relevant to research but have previously been scarcely considered: training data diversity and scale, model architectures and training recipes, and the breadth of evaluation tasks and datasets. We obtain encoders that are state-of-the-art on the existing and proposed benchmarks. We also identify what matters for training these encoders, such that this work can be extended when more data are available or better architectures are proposed. Specifically, across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery, we find self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance. We show the importance of data diversity in both stages. To support ongoing research and application, we will release the model checkpoints.

cs.LG

[266] BERT-VQA: Visual Question Answering on Plots

Tai Vu, Robert Yang

Main category: cs.LG

TL;DR: BERT-VQA model for visual question answering on plots underperformed baseline, disproving hypothesis about VisualBERT’s cross-modality effectiveness.

DetailsMotivation: To tackle visual question answering on plots, requiring information exchange between vision and language domains.

Method: Developed BERT-VQA using VisualBERT architecture with pretrained ResNet 101 image encoder and optional joint fusion, compared against LSTM+CNN+classifier baseline.

Result: Final outcome disproved the hypothesis that VisualBERT’s cross-modality module is essential for aligning plot components with question phrases.

Conclusion: Provided insights into the difficulty of plot question answering and appropriateness of different model architectures for this problem.

Abstract: Visual question answering has been an exciting challenge in the field of natural language understanding, as it requires deep learning models to exchange information from both vision and language domains. In this project, we aim to tackle a subtask of this problem, namely visual question answering on plots. To achieve this, we developed BERT-VQA, a VisualBERT-based model architecture with a pretrained ResNet 101 image encoder, along with a potential addition of joint fusion. We trained and evaluated this model against a baseline that consisted of a LSTM, a CNN, and a shallow classifier. The final outcome disproved our core hypothesis that the cross-modality module in VisualBERT is essential in aligning plot components with question phrases. Therefore, our work provided valuable insights into the difficulty of the plot question answering challenge as well as the appropriateness of different model architectures in solving this problem.

[267] Contextual Attention-Based Multimodal Fusion of LLM and CNN for Sentiment Analysis

Meriem Zerkouk, Miloud Mihoubi, Belkacem Chikhaoui

Main category: cs.LG

TL;DR: Novel multimodal sentiment analysis approach combining CNN image processing with LLM text analysis using GPT and prompt engineering, achieving 2.43% accuracy and 5.18% F1-score improvement on CrisisMMD dataset for disaster management.

DetailsMotivation: To improve crisis management by better understanding public sentiment during natural disasters through enhanced multimodal analysis of social media data, addressing limitations of conventional separate modality processing.

Method: Integrates CNN-based image analysis with LLM-based text processing using GPT and prompt engineering, introduces contextual attention mechanism for intermodal relationship modeling, and uses deep neural network architecture for feature fusion.

Result: Achieves 2.43% increase in accuracy and 5.18% improvement in F1-score compared to existing baselines, demonstrating superior performance in classifying social media data into informative/noninformative categories across various natural disasters.

Conclusion: The approach provides deeper sentiment insights during crises and presents a promising direction for AI-driven crisis management solutions, with practical implications for real-time disaster response optimization.

Abstract: This paper introduces a novel approach for multimodal sentiment analysis on social media, particularly in the context of natural disasters, where understanding public sentiment is crucial for effective crisis management. Unlike conventional methods that process text and image modalities separately, our approach seamlessly integrates Convolutional Neural Network (CNN) based image analysis with Large Language Model (LLM) based text processing, leveraging Generative Pre-trained Transformer (GPT) and prompt engineering to extract sentiment relevant features from the CrisisMMD dataset. To effectively model intermodal relationships, we introduce a contextual attention mechanism within the fusion process. Leveraging contextual-attention layers, this mechanism effectively captures intermodality interactions, enhancing the model’s comprehension of complex relationships between textual and visual data. The deep neural network architecture of our model learns from these fused features, leading to improved accuracy compared to existing baselines. Experimental results demonstrate significant advancements in classifying social media data into informative and noninformative categories across various natural disasters. Our model achieves a notable 2.43% increase in accuracy and 5.18% in F1-score, highlighting its efficacy in processing complex multimodal data. Beyond quantitative metrics, our approach provides deeper insight into the sentiments expressed during crises. The practical implications extend to real time disaster management, where enhanced sentiment analysis can optimize the accuracy of emergency interventions. By bridging the gap between multimodal analysis, LLM powered text understanding, and disaster response, our work presents a promising direction for Artificial Intelligence (AI) driven crisis management solutions. Keywords:

[268] Strategies for training point distributions in physics-informed neural networks

Santosh Humagain, Toni Schneidereit

Main category: cs.LG

TL;DR: Systematic evaluation of training point distribution strategies for physics-informed neural networks (PINNs) shows that point distribution significantly impacts solution accuracy and is connected to differential equation characteristics.

DetailsMotivation: Physics-informed neural networks are emerging as promising alternatives for solving differential equations, but their performance depends on various factors including training point distribution, which hasn't been systematically studied.

Method: Tested two ordinary and two partial differential equations with five training point generation strategies using shallow network architectures (1-2 hidden layers), including novel sine-based training points inspired by Chebyshev nodes, with controlled weight initialization for reproducibility.

Result: Training point distributions significantly impact solution accuracy, with evidence showing the impact is connected to the characteristics of the differential equation being solved.

Conclusion: The choice of training point distribution is a critical factor in PINN performance and should be carefully considered based on the specific differential equation characteristics for optimal accuracy.

Abstract: Physics-informed neural networks approach the approximation of differential equations by directly incorporating their structure and given conditions in a loss function. This enables conditions like, e.g., invariants to be easily added during the modelling phase. In addition, the approach can be considered as mesh free and can be utilised to compute solutions on arbitrary grids after the training phase. Therefore, physics-informed neural networks are emerging as a promising alternative to solving differential equations with methods from numerical mathematics. However, their performance highly depends on a large variety of factors. In this paper, we systematically investigate and evaluate a core component of the approach, namely the training point distribution. We test two ordinary and two partial differential equations with five strategies for training data generation and shallow network architectures, with one and two hidden layers. In addition to common distributions, we introduce sine-based training points, which are motivated by the construction of Chebyshev nodes. The results are challenged by using certain parameter combinations like, e.g., random and fixed-seed weight initialisation for reproducibility. The results show the impact of the training point distributions on the solution accuracy and we find evidence that they are connected to the characteristics of the differential equation.

[269] Deep Graph Neural Point Process For Learning Temporal Interactive Networks

Su Chen, Xiaohua Qi, Xixun Lin, Yanmin Shang, Xiaolin Xu, Yangxi Li

Main category: cs.LG

TL;DR: DGNPP is a novel Deep Graph Neural Point Process model that combines static topological structure learning with dynamic temporal modeling for temporal interaction networks, outperforming previous approaches.

DetailsMotivation: Previous methods treated temporal interaction networks as coarse-grained multi-sequence prediction problems, ignoring the important influence of network topology structure.

Method: DGNPP uses two key modules: Node Aggregation Layer for capturing topological structures to generate static representations, and Self Attentive Layer for dynamically updating embeddings over time. Both embeddings are incorporated into the event intensity function and optimized via maximum likelihood estimation.

Result: Experimental evaluations on three public datasets show DGNPP achieves superior performance in both event prediction and time prediction tasks with high efficiency, significantly outperforming baseline models.

Conclusion: DGNPP effectively addresses the limitations of prior approaches by incorporating both network topology structure and temporal dynamics, demonstrating strong predictive capabilities for temporal interaction networks.

Abstract: Learning temporal interaction networks(TIN) is previously regarded as a coarse-grained multi-sequence prediction problem, ignoring the network topology structure influence. This paper addresses this limitation and a Deep Graph Neural Point Process(DGNPP) model for TIN is proposed. DGNPP consists of two key modules: the Node Aggregation Layer and the Self Attentive Layer. The Node Aggregation Layer captures topological structures to generate static representation for users and items, while the Self Attentive Layer dynamically updates embeddings over time. By incorporating both dynamic and static embeddings into the event intensity function and optimizing the model via maximum likelihood estimation, DGNPP predicts events and occurrence time effectively. Experimental evaluations on three public datasets demonstrate that DGNPP achieves superior performance in event prediction and time prediction tasks with high efficiency, significantly outperforming baseline models and effectively mitigating the limitations of prior approaches.

[270] A Recurrent Neural Network based Clustering Method for Binary Data Sets in Education

Mizuki Ohira, Toshimichi Saito

Main category: cs.LG

TL;DR: Recurrent neural network applied to cluster large S-P educational charts using network dynamics with multiple fixed points and basins of attraction.

DetailsMotivation: S-P charts become difficult to handle as student numbers increase, requiring a method to classify large charts into smaller manageable ones.

Method: Simple clustering method based on recurrent neural network dynamics with multiple fixed points, where basins of attraction form clusters corresponding to smaller S-P charts.

Result: Effectiveness confirmed through fundamental experiments using average caution index to evaluate clustering performance and characterize student answer pattern singularity.

Conclusion: The proposed recurrent neural network clustering method effectively handles large S-P educational charts by creating smaller clusters through network dynamics.

Abstract: This paper studies an application of a recurrent neural network to clustering method for the S-P chart: a binary data set used widely in education. As the number of students increases, the S-P chart becomes hard to handle. In order to classify the large chart into smaller charts, we present a simple clustering method based on the network dynamics. In the method, the network has multiple fixed points and basins of attraction give clusters corresponding to small S-P charts. In order to evaluate the clustering performance, we present an important feature quantity: average caution index that characterizes singularity of students answer oatterns. Performing fundamental experiments, effectiveness of the method is confirmed.

[271] RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning

Suhang Hu, Wei Hu, Yuhang Su, Fan Zhang

Main category: cs.LG

TL;DR: RISE is a two-stage framework that generates high-quality reasoning chains through reinforcement learning and uses them to improve VLM performance on complex visual tasks without manual annotation.

DetailsMotivation: VLMs struggle with complex reasoning tasks like emotion classification and context-driven object detection. Standard SFT ignores reasoning rationales, while Visual-RFT produces inconsistent reasoning chains due to lack of verified CoTs during pre-training.

Method: Two-stage framework: 1) RISE-CoT uses reinforcement learning to generate visually grounded, logically consistent Chains of Thought through an “annotation-reasoning-annotation” closed-loop that verifies reconstruction ability. 2) RISE-R1 filters high-quality CoTs for supervised fine-tuning followed by reinforcement fine-tuning to achieve expertise.

Result: RISE-trained Qwen2-VL-2B outperforms SFT and Visual-RFT on both complex and simple image annotation tasks, achieving robust performance and enhanced explainability.

Conclusion: RISE provides a self-supervised solution for advancing VLM reasoning capabilities without requiring manually annotated reasoning chains, enabling better performance on complex visual tasks.

Abstract: Vision-Language Models (VLMs) struggle with complex image annotation tasks, such as emotion classification and context-driven object detection, which demand sophisticated reasoning. Standard Supervised Fine-Tuning (SFT) focuses solely on annotation outcomes, ignoring underlying rationales, while Visual Reinforcement Fine-Tuning (Visual-RFT) produces inconsistent Chains of Thought (CoTs) due to the absence of high-quality, verified CoTs during pre-training. We introduce RISE (Reason-Inspire-Strengthen-Expertise), a two-stage framework to overcome these limitations. In the Reason stage (RISE-CoT), a reinforcement learning-driven “annotation-reasoning-annotation” closed-loop generates visually grounded, logically consistent CoTs by verifying their ability to reconstruct original annotations without direct leakage. The Inspire and Strengthen stage (RISE-R1) leverages a high-quality CoT subset, filtered by RISE-CoT rewards, for supervised fine-tuning, followed by reinforcement fine-tuning to produce interpretable reasoning and accurate annotations, achieving Expertise in complex visual tasks. Evaluated on complex and simple image annotation tasks, RISE-trained Qwen2-VL-2B outperforms SFT and Visual-RFT, achieving robust performance and enhanced explainability. RISE offers a self-supervised solution for advancing VLM reasoning without requiring manually annotated CoTs.

[272] Can Masked Autoencoders Also Listen to Birds?

Lukas Rauch, René Heinrich, Ilyass Moummad, Alexis Joly, Bernhard Sick, Christoph Scholz

Main category: cs.LG

TL;DR: Bird-MAE adapts Masked Autoencoders for bird sound classification, achieving state-of-the-art results through full-pipeline adaptation and introducing parameter-efficient prototypical probing that significantly outperforms linear probing.

DetailsMotivation: General-purpose audio MAEs fail to generalize to fine-grained domains like bird sound classification due to subtle inter-species differences and high intra-species acoustic variability, requiring domain-specific adaptation beyond just pretraining data.

Method: Systematically adapted pretraining recipe, fine-tuning methods, and frozen feature utilization using BirdSet (large-scale bioacoustic dataset). Introduced parameter-efficient prototypical probing to enhance frozen MAE representations.

Result: Bird-MAE achieves new SOTA in BirdSet’s multi-label classification benchmark. Prototypical probing outperforms linear probing by up to 37 percentage points in mAP and narrows the gap to fine-tuning. Demonstrates robust few-shot capabilities.

Conclusion: Tailored self-supervised learning pipelines with full-pipeline adaptation are crucial for fine-grained audio domains, and prototypical probing offers parameter-efficient alternative to fine-tuning in low-resource settings.

Abstract: Masked Autoencoders (MAEs) learn rich semantic representations in audio classification through an efficient self-supervised reconstruction task. However, general-purpose models fail to generalize well when applied directly to fine-grained audio domains. Specifically, bird-sound classification requires distinguishing subtle inter-species differences and managing high intra-species acoustic variability, revealing the performance limitations of general-domain Audio-MAEs. This work demonstrates that bridging this domain gap domain gap requires full-pipeline adaptation, not just domain-specific pretraining data. We systematically revisit and adapt the pretraining recipe, fine-tuning methods, and frozen feature utilization to bird sounds using BirdSet, a large-scale bioacoustic dataset comparable to AudioSet. Our resulting Bird-MAE achieves new state-of-the-art results in BirdSet’s multi-label classification benchmark. Additionally, we introduce the parameter-efficient prototypical probing, enhancing the utility of frozen MAE representations and closely approaching fine-tuning performance in low-resource settings. Bird-MAE’s prototypical probes outperform linear probing by up to 37 percentage points in mean average precision and narrow the gap to fine-tuning across BirdSet downstream tasks. Bird-MAE also demonstrates robust few-shot capabilities with prototypical probing in our newly established few-shot benchmark on BirdSet, highlighting the potential of tailored self-supervised learning pipelines for fine-grained audio domains.

[273] Data driven feedback linearization of nonlinear control systems via Lie derivatives and stacked regression approach

Lakshmi Priya P. K., Andreas Schwung

Main category: cs.LG

TL;DR: Novel method combines sparse regression and Lie derivatives to discover physical system equations and design feedback controllers that guarantee no internal dynamics.

DetailsMotivation: Discovering governing equations and designing effective feedback controllers for physical systems is challenging, especially with nonlinear dynamics. Existing approaches need better integration of system identification and control design.

Method: Uses sparse regression algorithm for system identification, then applies Lie derivatives to output function dictionary to design feedback controller with augmented constraint that prevents internal dynamics. Combines stacked regression with relative degree conditions.

Result: Proposes a methodology that can discover true governing equations and feedback linearize physical systems based on prior dynamic behavior knowledge.

Conclusion: The approach successfully integrates system identification and control design, providing a novel way to handle nonlinear physical systems through combined regression and mathematical transformation techniques.

Abstract: Discovering the governing equations of a physical system and designing an effective feedback controller remains one of the most challenging and intensive areas of ongoing research. This task demands a deep understanding of the system behavior, including the nonlinear factors that influence its dynamics. In this article, we propose a novel methodology for identifying a feedback linearized physical system based on known prior dynamic behavior. Initially, the system is identified using a sparse regression algorithm, subsequently a feedback controller is designed for the discovered system by applying Lie derivatives to the dictionary of output functions to derive an augmented constraint which guarantees that no internal dynamics are observed. Unlike the prior related works, the novel aspect of this article combines the approach of stacked regression algorithm and relative degree conditions to discover and feedback linearize the true governing equations of a physical model.

[274] Physically Plausible Data Augmentations for Wearable IMU-based Human Activity Recognition Using Physics Simulation

Nobuyuki Oishi, Philip Birch, Daniel Roggen, Paula Lago

Main category: cs.LG

TL;DR: PPDA uses physics simulation to create realistic sensor data augmentations that preserve activity meaning, outperforming traditional signal transformation methods by 3.7pp average F1 score improvement and reducing training data needs by up to 60%.

DetailsMotivation: Traditional signal transformation data augmentation methods in HAR often create physically implausible data that doesn't preserve activity meaning, limiting model generalization in real-world scenarios.

Method: Physically Plausible Data Augmentation (PPDA) leverages human body movement data from motion capture/video, incorporating realistic variabilities through physics simulation including body movement modifications, sensor placement changes, and hardware effects.

Result: PPDA improved macro F1 scores by average 3.7pp (up to 13pp) and achieved competitive performance with up to 60% fewer training subjects compared to traditional STDAs across three public datasets of daily activities and fitness workouts.

Conclusion: Physics simulation enables cost-effective, scalable generation of synthetic IMU data that addresses annotation scarcity in HAR while preserving physical plausibility, making PPDA a superior approach to traditional augmentation methods.

Abstract: The scarcity of high-quality labeled data in sensor-based Human Activity Recognition (HAR) hinders model performance and limits generalization across real-world scenarios. Data augmentation is a key strategy to mitigate this issue by enhancing the diversity of training datasets. Signal Transformation-based Data Augmentation (STDA) techniques have been widely used in HAR. However, these methods are often physically implausible, potentially resulting in augmented data that fails to preserve the original meaning of the activity labels. In this study, we introduce and systematically characterize Physically Plausible Data Augmentation (PPDA) enabled by physics simulation. PPDA leverages human body movement data from motion capture or video-based pose estimation and incorporates various realistic variabilities through physics simulation, including modifying body movements, sensor placements, and hardware-related effects. We compare the performance of PPDAs with traditional STDAs on three public datasets of daily activities and fitness workouts. First, we evaluate each augmentation method individually, directly comparing PPDAs to their STDA counterparts. Next, we assess how combining multiple PPDAs can reduce the need for initial data collection by varying the number of subjects used for training. Experiments show consistent benefits of PPDAs, improving macro F1 scores by an average of 3.7 pp (up to 13 pp) and achieving competitive performance with up to 60% fewer training subjects than STDAs. As the first systematic study of PPDA in sensor-based HAR, these results highlight the advantages of pursuing physical plausibility in data augmentation and the potential of physics simulation for generating synthetic Inertial Measurement Unit data for training deep learning HAR models. This cost-effective and scalable approach therefore helps address the annotation scarcity challenge in HAR.

[275] MACTAS: Self-Attention-Based Module for Inter-Agent Communication in Multi-Agent Reinforcement Learning

Maciej Wojtala, Bogusz Stefańczyk, Dominik Bogucki, Łukasz Lepak, Jakub Strykowski, Paweł Wawrzyński

Main category: cs.LG

TL;DR: A self-attention-based communication module for multi-agent reinforcement learning that is fully differentiable and achieves state-of-the-art performance on SMAC benchmark.

DetailsMotivation: Existing communication protocols in MARL are often complex and non-differentiable, while communication is essential for collective execution of complex tasks by human agents.

Method: Introduces a self-attention-based communication module that exchanges information between agents in MARL, fully differentiable and seamlessly integrable with any action-value function decomposition method.

Result: Experimental results on SMAC benchmark demonstrate effectiveness, achieving state-of-the-art performance on several maps.

Conclusion: The proposed approach provides a differentiable communication mechanism that enables agents to learn message generation in a reward-driven manner with fixed parameter count independent of agent numbers.

Abstract: Communication is essential for the collective execution of complex tasks by human agents, motivating interest in communication mechanisms for multi-agent reinforcement learning (MARL). However, existing communication protocols in MARL are often complex and non-differentiable. In this work, we introduce a self-attention-based communication module that exchanges information between the agents in MARL. Our proposed approach is fully differentiable, allowing agents to learn to generate messages in a reward-driven manner. The module can be seamlessly integrated with any action-value function decomposition method and can be viewed as an extension of such decompositions. Notably, it includes a fixed number of trainable parameters, independent of the number of agents. Experimental results on the SMAC benchmark demonstrate the effectiveness of our approach, which achieves state-of-the-art performance on several maps.

[276] Towards Human-AI Complementarity in Matching Tasks

Adrian Arnaiz-Rodriguez, Nina Corvelo Benz, Suhas Thejaswi, Nuria Oliver, Manuel Gomez-Rodriguez

Main category: cs.LG

TL;DR: A collaborative matching system (comatch) that combines human and AI decision making by having the algorithm only make decisions it’s highly confident in, deferring others to humans, resulting in better performance than either alone.

DetailsMotivation: Existing algorithmic matching systems don't achieve human-AI complementarity - decisions made by humans using AI systems aren't necessarily better than those made by humans or algorithms alone.

Method: Proposed collaborative matching (comatch) system that selects only decisions it’s most confident in and defers the rest to human decision makers, optimizing the allocation between human and AI decisions to maximize performance.

Result: Large-scale human study with 800 participants showed comatch outperforms both human participants and algorithmic matching alone in matching outcomes.

Conclusion: The collaborative approach of comatch successfully achieves human-AI complementarity in matching decisions, demonstrating superior performance over individual human or algorithmic decision making.

Abstract: Data-driven algorithmic matching systems promise to help human decision makers make better matching decisions in a wide variety of high-stakes application domains, such as healthcare and social service provision. However, existing systems are not designed to achieve human-AI complementarity: decisions made by a human using an algorithmic matching system are not necessarily better than those made by the human or by the algorithm alone. Our work aims to address this gap. To this end, we propose collaborative matching (comatch), a data-driven algorithmic matching system that takes a collaborative approach: rather than making all the matching decisions for a matching task like existing systems, it selects only the decisions that it is the most confident in, deferring the rest to the human decision maker. In the process, comatch optimizes how many decisions it makes and how many it defers to the human decision maker to provably maximize performance. We conduct a large-scale human subject study with $800$ participants to validate the proposed approach. The results demonstrate that the matching outcomes produced by comatch outperform those generated by either human participants or by algorithmic matching on their own. The data gathered in our human subject study and an implementation of our system are available as open source at https://github.com/Networks-Learning/human-AI-complementarity-matching.

[277] Hierarchical Conformal Classification

Floris den Hengst, Inès Blin, Majid Mohammadi, Syed Ihtesham Hussain Shah, Taraneh Younesian

Main category: cs.LG

TL;DR: Hierarchical conformal classification (HCC) extends conformal prediction to incorporate class hierarchies while maintaining coverage guarantees, producing more structured and semantically meaningful prediction sets.

DetailsMotivation: Standard conformal prediction treats classes as flat and unstructured, ignoring valuable domain knowledge such as semantic relationships and hierarchical structures among class labels.

Method: Formulates HCC as a constrained optimization problem that yields prediction sets composed of nodes at different hierarchy levels. Shows that a smaller, well-structured subset of candidate solutions suffices to ensure coverage while maintaining optimality.

Result: Empirical evaluation on three benchmarks (audio, image, text data) demonstrates advantages of the approach. User study shows annotators significantly prefer hierarchical over flat prediction sets.

Conclusion: HCC successfully incorporates class hierarchies into conformal prediction, providing more meaningful prediction sets while preserving statistical coverage guarantees.

Abstract: Conformal prediction (CP) is a powerful framework for quantifying uncertainty in machine learning models, offering reliable predictions with finite-sample coverage guarantees. When applied to classification, CP produces a prediction set of possible labels that is guaranteed to contain the true label with high probability, regardless of the underlying classifier. However, standard CP treats classes as flat and unstructured, ignoring domain knowledge such as semantic relationships or hierarchical structure among class labels. This paper presents hierarchical conformal classification (HCC), an extension of CP that incorporates class hierarchies into both the structure and semantics of prediction sets. We formulate HCC as a constrained optimization problem whose solutions yield prediction sets composed of nodes at different levels of the hierarchy, while maintaining coverage guarantees. To address the combinatorial nature of the problem, we formally show that a much smaller, well-structured subset of candidate solutions suffices to ensure coverage while upholding optimality. An empirical evaluation on three new benchmarks consisting of audio, image, and text data highlights the advantages of our approach, and a user study shows that annotators significantly prefer hierarchical over flat prediction sets.

[278] Efficient Constraint-Aware Flow Matching via Randomized Exploration

Zhengyan Huan, Jacob Boerma, Li-Ping Liu, Shuchin Aeron

Main category: cs.LG

TL;DR: Constrained Flow Matching for generating samples that satisfy given constraints, with two approaches: differentiable distance penalty for known constraints and randomization for oracle-based constraints.

DetailsMotivation: Existing Flow Matching methods lack mechanisms to ensure generated samples satisfy specific constraints, limiting their applicability in real-world scenarios where constraints must be met.

Method: Two approaches: (1) For differentiable constraints, add penalty term to FM objective; (2) For oracle-based constraints, use randomization to learn mean flow with high constraint satisfaction likelihood. Also proposes two-stage approach for efficiency.

Result: Numerical experiments show significant gains in constraint satisfaction while maintaining target distribution matching. Successfully applied to adversarial example generation using hard-label black-box classifier queries.

Conclusion: Proposed methods effectively handle constrained generation in Flow Matching, with practical applications in adversarial training. Two-stage approach improves computational efficiency for oracle-based constraints.

Abstract: We consider the problem of generating samples via Flow Matching (FM) with an additional requirement that the generated samples must satisfy given constraints. We consider two scenarios, viz.: (a) when a differentiable distance function to the constraint set is given, and (b) when the constraint set is only available via queries to a membership oracle. For case (a), we propose a simple adaptation of the FM objective with an additional term that penalizes the distance between the constraint set and the generated samples. For case (b), we propose to employ randomization and learn a mean flow that is numerically shown to have a high likelihood of satisfying the constraints. This approach deviates significantly from existing works that require simple convex constraints, knowledge of a barrier function, or a reflection mechanism to constrain the probability flow. Furthermore, in the proposed setting we show that a two-stage approach, where both stages approximate the same original flow but with only the second stage probing the constraints via randomization, is more computationally efficient. Through several synthetic cases of constrained generation, we numerically show that the proposed approaches achieve significant gains in terms of constraint satisfaction while matching the target distributions. As a showcase for a practical oracle-based constraint, we show how our approach can be used for training an adversarial example generator, using queries to a hard-label black-box classifier. We conclude with several future research directions. Our code is available at https://github.com/ZhengyanHuan/FM-RE.

[279] Decoding Communications with Partial Information

Dylan Cope, Peter McBurney

Main category: cs.LG

TL;DR: Exploring language acquisition under partial observability where learners must infer hidden information from environment, actions, and messages rather than having full access to all relevant context.

DetailsMotivation: Traditional language acquisition models assume full observability, but real-world learning often involves inferring hidden information from limited context, making partial observability a more realistic and challenging setting.

Method: The paper presents a learning-based algorithm that decodes private information by analyzing environmental knowledge, actions taken, and messages sent, demonstrated through toy examples and formal exploration of challenges.

Result: The approach shows that language acquisition can be successfully performed even under partial observability conditions by inferring hidden information from available contextual cues.

Conclusion: Partial observability presents a more realistic and challenging setting for language acquisition, and learning-based methods can effectively decode private information to facilitate language learning in such environments.

Abstract: Machine language acquisition is often presented as a problem of imitation learning: there exists a community of language users from which a learner observes speech acts and attempts to decode the mappings between utterances and situations. However, an interesting consideration that is typically unaddressed is partial observability, i.e. the learner is assumed to see all relevant information. This paper explores relaxing this assumption, thereby posing a more challenging setting where such information needs to be inferred from knowledge of the environment, the actions taken, and messages sent. We see several motivating examples of this problem, demonstrate how they can be solved in a toy setting, and formally explore challenges that arise in more general settings. A learning-based algorithm is then presented to perform the decoding of private information to facilitate language acquisition.

[280] A Dual-Attention Graph Network for fMRI Data Classification

Amirali Arbab, Zeinab Davarani, Mehran Safayani

Main category: cs.LG

TL;DR: A novel fMRI classification framework using dynamic graph creation and spatio-temporal attention mechanisms for Autism Spectrum Disorder diagnosis, achieving superior performance over static approaches.

DetailsMotivation: Current fMRI classification methods rely on static functional connectivity and fail to comprehensively capture spatio-temporal relationships in neural activity dynamics, which is crucial for neuroscience advancement.

Method: Dynamic inference of functional brain connectivity using transformer-based attention mechanisms, constructing time-varying graphs processed with Graph Convolutional Networks (GCNs) and transformers to capture both localized interactions and global temporal dependencies.

Result: Achieved 63.2% accuracy and 60.0 AUC on ABIDE dataset subset, significantly outperforming static graph-based approaches (e.g., GCN: 51.8%).

Conclusion: The framework validates the efficacy of joint modeling of dynamic connectivity and spatio-temporal context for fMRI classification, with core novelty in attention-driven dynamic graph creation and hierarchical spatio-temporal feature fusion.

Abstract: Understanding the complex neural activity dynamics is crucial for the development of the field of neuroscience. Although current functional MRI classification approaches tend to be based on static functional connectivity or cannot capture spatio-temporal relationships comprehensively, we present a new framework that leverages dynamic graph creation and spatiotemporal attention mechanisms for Autism Spectrum Disorder(ASD) diagnosis. The approach used in this research dynamically infers functional brain connectivity in each time interval using transformer-based attention mechanisms, enabling the model to selectively focus on crucial brain regions and time segments. By constructing time-varying graphs that are then processed with Graph Convolutional Networks (GCNs) and transformers, our method successfully captures both localized interactions and global temporal dependencies. Evaluated on the subset of ABIDE dataset, our model achieves 63.2 accuracy and 60.0 AUC, outperforming static graph-based approaches (e.g., GCN:51.8). This validates the efficacy of joint modeling of dynamic connectivity and spatio-temporal context for fMRI classification. The core novelty arises from (1) attention-driven dynamic graph creation that learns temporal brain region interactions and (2) hierarchical spatio-temporal feature fusion through GCNtransformer fusion.

[281] X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts Architectures on HPC Platforms

Yueming Yuan, Ahan Gupta, Jianping Li, Sajal Dash, Feiyi Wang, Minjia Zhang

Main category: cs.LG

TL;DR: X-MoE is a novel training system that enables scalable training of large Mixture-of-Experts models up to 545B parameters on non-NVIDIA platforms, achieving 10x larger models than existing methods with high throughput.

DetailsMotivation: Current MoE training systems have substantial activation memory overhead, costly communication, and are optimized primarily for NVIDIA GPUs, leaving significant computational potential untapped on other platforms.

Method: X-MoE uses efficient padding-free MoE training with cross-platform kernels, redundancy-bypassing dispatch, and hybrid parallelism with sequence-sharded MoE blocks.

Result: The system scales DeepSeek-style MoEs up to 545 billion parameters across 1024 AMD MI250X GPUs on the Frontier supercomputer, achieving 10x larger trainable models than existing methods under the same hardware budget.

Conclusion: X-MoE successfully addresses the scalability limitations of current MoE training systems and enables efficient training of next-generation MoE architectures on diverse hardware platforms.

Abstract: Emerging expert-specialized Mixture-of-Experts (MoE) architectures, such as DeepSeek-MoE, deliver strong model quality through fine-grained expert segmentation and large top-k routing. However, their scalability is limited by substantial activation memory overhead and costly all-to-all communication. Furthermore, current MoE training systems - primarily optimized for NVIDIA GPUs

  • perform suboptimally on non-NVIDIA platforms, leaving significant computational potential untapped. In this work, we present X-MoE, a novel MoE training system designed to deliver scalable training performance for next-generation MoE architectures. X-MoE achieves this via several novel techniques, including efficient padding-free MoE training with cross-platform kernels, redundancy-bypassing dispatch, and hybrid parallelism with sequence-sharded MoE blocks. Our evaluation on the Frontier supercomputer, powered by AMD MI250X GPUs, shows that X-MoE scales DeepSeek-style MoEs up to 545 billion parameters across 1024 GPUs - 10x larger than the largest trainable model with existing methods under the same hardware budget, while maintaining high training throughput. The source code of X-MoE is available at https://github.com/Supercomputing-System-AI-Lab/X-MoE.

[282] Dimension lower bounds for linear approaches to function approximation

Daniel Hsu

Main category: cs.LG

TL;DR: Linear algebraic approach for dimension lower bounds in L^2 function approximation, extending Kolmogorov n-width bounds to kernel methods.

DetailsMotivation: To establish fundamental limitations on linear methods for function approximation problems by providing dimension lower bounds that apply to kernel methods.

Method: Uses a linear algebraic argument previously applied to Kolmogorov n-widths (Barron, 1993) and extends it to derive sample size lower bounds for kernel methods in L^2 function approximation.

Result: The approach yields dimension lower bounds that constrain the performance of linear approximation methods, including kernel-based approaches.

Conclusion: The linear algebraic framework provides a unified method for establishing fundamental lower bounds on the complexity required by linear methods for function approximation tasks.

Abstract: This short note presents a linear algebraic approach to proving dimension lower bounds for linear methods that solve $L^2$ function approximation problems. The basic argument has appeared in the literature before (e.g., Barron, 1993) for establishing lower bounds on Kolmogorov $n$-widths. The argument is applied to give sample size lower bounds for kernel methods.

[283] Counterfactual Probabilistic Diffusion with Expert Models

Wenhao Mu, Zhi Cao, Mehmed Uludag, Alexander Rodríguez

Main category: cs.LG

TL;DR: ODE-Diff is a time series diffusion framework that combines expert mechanistic models with data-driven approaches for reliable counterfactual distribution prediction in complex dynamical systems.

DetailsMotivation: Existing methods for predicting counterfactual distributions often rely on point estimates or purely data-driven models, which struggle with data scarcity and lack reliability in scientific domains like public health and medicine.

Method: A time series diffusion-based framework that incorporates guidance from imperfect expert models by extracting high-level signals to serve as structured priors for generative modeling, bridging mechanistic and data-driven approaches.

Result: ODE-Diff consistently outperforms strong baselines in both point prediction and distributional accuracy across semi-synthetic COVID-19 simulations, synthetic pharmacological dynamics, and real-world case studies.

Conclusion: The proposed method enables more reliable and interpretable causal inference by effectively combining expert knowledge with data-driven modeling, addressing limitations of existing approaches in data-scarce scenarios.

Abstract: Predicting counterfactual distributions in complex dynamical systems is essential for scientific modeling and decision-making in domains such as public health and medicine. However, existing methods often rely on point estimates or purely data-driven models, which tend to falter under data scarcity. We propose a time series diffusion-based framework that incorporates guidance from imperfect expert models by extracting high-level signals to serve as structured priors for generative modeling. Our method, ODE-Diff, bridges mechanistic and data-driven approaches, enabling more reliable and interpretable causal inference. We evaluate ODE-Diff across semi-synthetic COVID-19 simulations, synthetic pharmacological dynamics, and real-world case studies, demonstrating that it consistently outperforms strong baselines in both point prediction and distributional accuracy.

[284] Adaptive Conformal Prediction Intervals Over Trajectory Ensembles

Ruipu Li, Daniel Menacho, Alexander Rodríguez

Main category: cs.LG

TL;DR: A conformal prediction framework that transforms uncalibrated trajectory samples into calibrated prediction intervals with theoretical coverage guarantees.

DetailsMotivation: Trajectory predictions from probabilistic models or multiple predictors are commonly uncalibrated despite reflecting inherent uncertainty, limiting their reliability in critical applications like autonomous driving and forecasting.

Method: Proposes a unified framework using conformal prediction with novel online update and optimization steps that capture inter-step dependencies to produce calibrated prediction intervals around trajectories.

Result: The method generates discontinuous prediction intervals that naturally capture temporal dependencies and yields sharper, more adaptive uncertainty estimates compared to uncalibrated trajectories.

Conclusion: The framework provides theoretically guaranteed coverage for trajectory predictions, making uncertainty estimates more reliable and adaptive for real-world applications requiring calibrated trajectory forecasts.

Abstract: Future trajectories play an important role across domains such as autonomous driving, hurricane forecasting, and epidemic modeling, where practitioners commonly generate ensemble paths by sampling probabilistic models or leveraging multiple autoregressive predictors. While these trajectories reflect inherent uncertainty, they are typically uncalibrated. We propose a unified framework based on conformal prediction that transforms sampled trajectories into calibrated prediction intervals with theoretical coverage guarantees. By introducing a novel online update step and an optimization step that captures inter-step dependencies, our method can produce discontinuous prediction intervals around each trajectory, naturally capture temporal dependencies, and yield sharper, more adaptive uncertainty estimates.

[285] Batching-Aware Joint Model Onloading and Offloading for Hierarchical Multi-Task Inference

Seohyeon Cha, Kevin Chan, Gustavo de Veciana, Haris Vikalo

Main category: cs.LG

TL;DR: J3O is a framework for joint optimization of model onloading (deployment) and query offloading (routing) in multi-task edge inference systems, achieving near-optimal accuracy with significantly reduced runtime.

DetailsMotivation: Real-world applications like autonomous driving and AR require concurrent execution of multiple tasks, but existing edge inference frameworks only handle single-task scenarios, creating a need for unified multi-task optimization.

Method: Formulated as mixed-integer program, J3O uses alternating algorithm with greedy model selection via Lagrangian-relaxed submodular optimization and optimal offloading via constrained linear programming, extended for edge batching.

Result: J3O achieves over 97% of optimal accuracy while requiring less than 15% of the runtime compared to optimal solver across multi-task benchmarks.

Conclusion: The proposed J3O framework effectively addresses multi-task edge inference optimization, providing near-optimal performance with practical computational efficiency for real-world deployment.

Abstract: The growing demand for intelligent services on resource-constrained edge devices has spurred the development of collaborative inference systems that distribute workloads across end devices, edge servers, and the cloud. While most existing frameworks focus on single-task, single-model scenarios, many real-world applications (e.g., autonomous driving and augmented reality) require concurrent execution of diverse tasks including detection, segmentation, and depth estimation. In this work, we propose a unified framework to jointly decide which multi-task models to deploy (onload) at clients and edge servers, and how to route queries across the hierarchy (offload) to maximize overall inference accuracy under memory, compute, and communication constraints. We formulate this as a mixed-integer program and introduce J3O (Joint Optimization of Onloading and Offloading), an alternating algorithm that (i) greedily selects models to onload via Lagrangian-relaxed submodular optimization and (ii) determines optimal offloading via constrained linear programming. We further extend J3O to account for batching at the edge, maintaining scalability under heterogeneous task loads. Experiments show J3O consistently achieves over $97%$ of the optimal accuracy while incurring less than $15%$ of the runtime required by the optimal solver across multi-task benchmarks.

Nooshin Bahador, Milad Lankarany

Main category: cs.LG

TL;DR: Chirp-based outlier detection with weighted spatial metrics effectively localizes seizure onset zones, showing high concordance in successful surgical cases.

DetailsMotivation: To develop a quantitative framework for evaluating spatial concordance between clinically defined seizure onset zones and statistically anomalous channels identified through time-frequency analysis of chirp events.

Method: Two-step methodology: (1) Unsupervised Outlier Detection using Local Outlier Factor analysis with adaptive neighborhood selection on spectro-temporal features; (2) Spatial Correlation Analysis computing exact co-occurrence metrics and weighted index similarity incorporating hemispheric congruence and electrode proximity.

Result: LOF-based approach effectively detects outliers, with weighted index matching outperforming exact matching. Highest performance in seizure-free patients (Index Precision mean: 0.903) and successful surgical outcomes (Index Precision mean: 0.865), while failure cases showed lower concordance (Index Precision mean: 0.460).

Conclusion: Chirp-based outlier detection combined with weighted spatial metrics provides a complementary method for SOZ localization, particularly effective in patients with successful surgical outcomes.

Abstract: This study presents a quantitative framework for evaluating the spatial concordance between clinically defined seizure onset zones (SOZs) and statistically anomalous channels identified through time-frequency analysis of chirp events. The proposed pipeline employs a two-step methodology: (1) Unsupervised Outlier Detection, where Local Outlier Factor (LOF) analysis with adaptive neighborhood selection identifies anomalous channels based on spectro-temporal features of chirp (Onset frequency, offset frequency, and temporal duration); and (2) Spatial Correlation Analysis, which computes both exact co-occurrence metrics and weighted index similarity, incorporating hemispheric congruence and electrode proximity. Key findings demonstrate that the LOF-based approach (N neighbors=20, contamination=0.2) effectively detects outliers, with index matching (weighted by channel proximity) outperforming exact matching in SOZ localization. Performance metrics (precision, recall, F1) were highest for seizure-free patients (Index Precision mean: 0.903) and those with successful surgical outcomes (Index Precision mean: 0.865), whereas failure cases exhibited lower concordance (Index Precision mean: 0.460). The key takeaway is that chirp-based outlier detection, combined with weighted spatial metrics, provides a complementary method for SOZ localization, particularly in patients with successful surgical outcomes.

[287] NovoMolGen: Rethinking Molecular Language Model Pretraining

Kamran Chitsaz, Roshan Balaji, Quentin Fournier, Nirav Pravinbhai Bhatt, Sarath Chandar

Main category: cs.LG

TL;DR: NovoMolGen is a transformer-based foundation model pretrained on 1.5B molecules that establishes new SOTA results for molecular generation, outperforming prior Mol-LLMs and specialized models in both unconstrained and goal-directed tasks.

DetailsMotivation: Efficient exploration of vast chemical space (10^23 to 10^60 molecules) requires scalable approaches. While Molecular Large Language Models (Mol-LLMs) have emerged, there's limited understanding of how standard NLP practices (text representations, tokenization, model size, dataset scale) impact molecular generation performance.

Method: Introduced NovoMolGen family of transformer-based foundation models pretrained on 1.5 billion molecules. Systematically investigated critical aspects: textual representations, tokenization strategies, model size, and dataset scale impact on molecular generation.

Result: Identified weak correlation between pretraining metrics and downstream performance, revealing important distinctions between molecular and general NLP training dynamics. Substantially outperformed prior Mol-LLMs and specialized generative models in both unconstrained and goal-directed molecular generation tasks.

Conclusion: NovoMolGen provides a robust foundation for advancing efficient and effective molecular modeling strategies, establishing new state-of-the-art results for de-novo molecule generation with desired property profiles.

Abstract: Designing de-novo molecules with desired property profiles requires efficient exploration of the vast chemical space ranging from $10^{23}$ to $10^{60}$ possible synthesizable candidates. While various deep generative models have been developed to design small molecules using diverse input representations, Molecular Large Language Models (Mol-LLMs) based on string representations have emerged as a scalable approach capable of exploring billions of molecules. However, there remains limited understanding regarding how standard language modeling practices such as textual representations, tokenization strategies, model size, and dataset scale impact molecular generation performance. In this work, we systematically investigate these critical aspects by introducing NovoMolGen, a family of transformer-based foundation models pretrained on 1.5 billion molecules for de-novo molecule generation. Through extensive empirical analyses, we identify a weak correlation between performance metrics measured during pretraining and actual downstream performance, revealing important distinctions between molecular and general NLP training dynamics. NovoMolGen establishes new state-of-the-art results, substantially outperforming prior Mol-LLMs and specialized generative models in both unconstrained and goal-directed molecular generation tasks, thus providing a robust foundation for advancing efficient and effective molecular modeling strategies.

[288] Decentralized Contextual Bandits with Network Adaptivity

Chuyun Deng, Huiwen Jia

Main category: cs.LG

TL;DR: Network-aware contextual bandit algorithms that enable adaptive information sharing across networked agents, reducing learning complexity from O(N) to sublinear O(√N) while maintaining lighter communication costs than fully centralized approaches.

DetailsMotivation: Address the gap in contextual bandits for networked environments where information is partially shared, as classical approaches assume either fully centralized data or entirely isolated learners without considering structural similarities and local differences across multiple locations.

Method: Developed two network-aware UCB algorithms: NetLinUCB and Net-SGD-UCB, which use dynamically updated network weights to guide adaptive information sharing. The approach decomposes learning into global and local components, allowing agents to share computed summaries about homogeneous features without full synchronization.

Result: Established regret bounds showing reduced learning complexity from O(N) to sublinear O(√N) where N is network size. NetLinUCB excels in low-noise regimes with fine-grained heterogeneity, while Net-SGD-UCB is robust to high-dimensional, high-variance contexts. Demonstrated effectiveness in simulated pricing environments compared to standard benchmarks.

Conclusion: The proposed network-aware algorithms successfully bridge the gap between fully centralized and isolated learning approaches, enabling efficient information sharing across networks while maintaining practical communication costs and providing complementary strengths for different environmental conditions.

Abstract: We consider contextual linear bandits over networks, a class of sequential decision-making problems where learning occurs simultaneously across multiple locations and the reward distributions share structural similarities while also exhibiting local differences. While classical contextual bandits assume either fully centralized data or entirely isolated learners, much remains unexplored in networked environments when information is partially shared. In this paper, we address this gap by developing two network-aware Upper Confidence Bound (UCB) algorithms, NetLinUCB and Net-SGD-UCB, which enable adaptive information sharing guided by dynamically updated network weights. Our approach decompose learning into global and local components and as a result allow agents to benefit from shared structure without full synchronization. Both algorithms incur lighter communication costs compared to a fully centralized setting as agents only share computed summaries regarding the homogeneous features. We establish regret bounds showing that our methods reduce the learning complexity associated with the shared structure from $O(N)$ to sublinear $O(\sqrt{N})$, where $N$ is the size of the network. The two algorithms reveal complementary strengths: NetLinUCB excels in low-noise regimes with fine-grained heterogeneity, while Net-SGD-UCB is robust to high-dimensional, high-variance contexts. We further demonstrate the effectiveness of our methods across simulated pricing environments compared to standard benchmarks.

[289] MAVIS: Multi-Objective Alignment via Value-Guided Inference-Time Search

Jeremy Carleton, Debajoy Mukherjee, Srinivas Shakkottai, Dileep Kalathil

Main category: cs.LG

TL;DR: MAVIS enables dynamic multi-objective alignment of LLMs at inference time using small value models and tilting functions, eliminating the need for expensive per-objective fine-tuning.

DetailsMotivation: Traditional fine-tuning for each objective combination is computationally expensive and inflexible for handling diverse user preferences in multi-objective LLM applications.

Method: Trains small value models for each objective, combines them with user-specified weights at inference time to create tilting functions that adjust the base model’s output distribution using KL-regularized policy.

Result: Outperforms baseline methods that fine-tune per-objective models and combine them post hoc, approaching performance of models fine-tuned for exact user preferences.

Conclusion: MAVIS provides a lightweight, flexible inference-time framework for dynamic multi-objective alignment without modifying base model weights, offering computational efficiency and adaptability to diverse user preferences.

Abstract: Large Language Models (LLMs) are increasingly deployed across diverse applications that demand balancing multiple, often conflicting, objectives – such as helpfulness, harmlessness, or humor. Aligning outputs to user-specific preferences in such multi-objective settings typically requires fine-tuning models for each objective or preference configuration, which is computationally expensive and inflexible. We introduce MAVIS – Multi-Objective Alignment via Value-Guided Inference-Time Search – a lightweight inference-time alignment framework that enables dynamic control over LLM behavior without modifying the base model’s weights. MAVIS trains a set of small value models, each corresponding to a distinct objective. At inference time, these value models are combined using user-specified weights to produce a tilting function that adjusts the base model’s output distribution toward desired trade-offs. The value models are trained using a simple iterative algorithm that ensures monotonic improvement of the KL-regularized policy. We show empirically that MAVIS outperforms baselines that fine-tune per-objective models and combine them post hoc, and even approaches the performance of the idealized setting where models are fine-tuned for a user’s exact preferences.

[290] EventTSF: Event-Aware Non-Stationary Time Series Forecasting

Yunfeng Ge, Ming Jin, Yiji Zhao, Hongyan Li, Bo Du, Chang Xu, Shirui Pan

Main category: cs.LG

TL;DR: EventTSF is a novel autoregressive diffusion framework that integrates textual events with time series data to improve non-stationary forecasting by addressing multimodal synchronization challenges and event-induced uncertainty.

DetailsMotivation: Current time series forecasting approaches are limited by single-modality methods that fail to incorporate natural language-based external events, resulting in limited contextual knowledge and poor performance in non-stationary domains like energy and transportation.

Method: EventTSF uses autoregressive diffusion with flow matching at each step, adaptively controlling flow matching timesteps based on event semantic signals. It employs a multimodal U-shaped diffusion transformer to fuse temporal and textual modalities across different resolutions.

Result: Extensive experiments on 8 synthetic and real-world datasets show EventTSF outperforms 12 baselines, achieving 10.7% higher forecasting accuracy and 1.13x faster training efficiency.

Conclusion: The proposed EventTSF framework successfully addresses the challenges of multimodal time series forecasting by effectively integrating textual events with temporal data through adaptive diffusion and transformer-based fusion, demonstrating significant improvements in both accuracy and efficiency.

Abstract: Time series forecasting plays a vital role in critical domains like energy and transportation, where non-stationary dynamics are deeply intertwined with events in other modalities such as texts. However, incorporating natural language-based external events to improve non-stationary forecasting remains largely unexplored, as most approaches still rely on a single modality, resulting in limited contextual knowledge and model underperformance. Enabling fine-grained multimodal interactions between temporal and textual data is challenged by three fundamental issues: (1) the difficulty of fine-grained synchronization between time-varying discrete textual events and continuous time series; (2) the inherent temporal uncertainty introduced by textual semantics; and (3) the misalignment between textual event embeddings and multi-resolution temporal patterns. In this work, we address these challenges by introducing event-aware non-stationary time series forecasting (EventTSF), an autoregressive generation framework that integrates historical time series with textual events to make subsequent forecasts. Specifically, EventTSF uses autoregressive diffusion with flow matching at each step to capture nuanced temporal-event interactions. To handle event-induced uncertainty, flow matching timesteps are adaptively controlled according to event semantic signals. The underlying denoiser employs a multimodal U-shaped diffusion transformer that efficiently fuses temporal and textual modalities across different resolutions. Extensive experiments on 8 synthetic and real-world datasets show that EventTSF outperforms 12 baselines across diverse event-aware non-stationary time series forecasting scenarios, achieving substantial improvements of 10.7% higher forecasting accuracy and $1.13\times$ faster training efficiency.

[291] SVDformer: Direction-Aware Spectral Graph Embedding Learning via SVD and Transformer

Jiayu Fang, Zhiqi Shao, S T Boris Choy, Junbin Gao

Main category: cs.LG

TL;DR: SVDformer combines SVD and Transformer for direction-aware graph learning, outperforming state-of-the-art methods on directed graph node classification.

DetailsMotivation: Existing directed graph neural networks struggle to capture both directional semantics and global structural patterns due to isotropic aggregation and localized filtering mechanisms.

Method: SVDformer refines singular value embeddings through multi-head self-attention to enhance critical spectral components and suppress noise, using singular vectors as directional projection bases and singular values as scaling factors to model multi-scale interactions between incoming/outgoing edge patterns.

Result: Extensive experiments on six directed graph benchmarks show SVDformer consistently outperforms state-of-the-art GNNs and direction-aware baselines on node classification tasks.

Conclusion: SVDformer establishes a new paradigm for learning representations on directed graphs by synergizing SVD and Transformer architecture for explicit direction preservation during feature propagation.

Abstract: Directed graphs are widely used to model asymmetric relationships in real-world systems. However, existing directed graph neural networks often struggle to jointly capture directional semantics and global structural patterns due to their isotropic aggregation mechanisms and localized filtering mechanisms. To address this limitation, this paper proposes SVDformer, a novel framework that synergizes SVD and Transformer architecture for direction-aware graph representation learning. SVDformer first refines singular value embeddings through multi-head self-attention, adaptively enhancing critical spectral components while suppressing high-frequency noise. This enables learnable low-pass/high-pass graph filtering without requiring spectral kernels. Furthermore, by treating singular vectors as directional projection bases and singular values as scaling factors, SVDformer uses the Transformer to model multi-scale interactions between incoming/outgoing edge patterns through attention weights, thereby explicitly preserving edge directionality during feature propagation. Extensive experiments on six directed graph benchmarks demonstrate that SVDformer consistently outperforms state-of-the-art GNNs and direction-aware baselines on node classification tasks, establishing a new paradigm for learning representations on directed graphs.

[292] Dynamic Design of Machine Learning Pipelines via Metalearning

Edesio Alcobaça, André C. P. L. F. de Carvalho

Main category: cs.LG

TL;DR: Metalearning method that dynamically designs search spaces for AutoML using historical knowledge to accelerate optimization and reduce computational costs.

DetailsMotivation: Traditional AutoML methods have high computational costs and large search spaces that can lead to overfitting, requiring more efficient optimization strategies.

Method: Proposes a metalearning approach that uses historical metaknowledge to select promising regions of the search space, reducing the exploration area for AutoML systems.

Result: Reduced runtime by 89% in Random Search and significantly reduced search space size (1.8/13 preprocessors and 4.3/16 classifiers) without compromising predictive performance. Also showed competitive performance when adapted to Auto-Sklearn.

Conclusion: The metalearning method effectively accelerates AutoML optimization by intelligently reducing search spaces while maintaining performance, with additional insights on meta-feature selection and explainability.

Abstract: Automated machine learning (AutoML) has democratized the design of machine learning based systems, by automating model selection, hyperparameter tuning and feature engineering. However, the high computational cost associated with traditional search and optimization strategies, such as Random Search, Particle Swarm Optimization and Bayesian Optimization, remains a significant challenge. Moreover, AutoML systems typically explore a large search space, which can lead to overfitting. This paper introduces a metalearning method for dynamically designing search spaces for AutoML system. The proposed method uses historical metaknowledge to select promising regions of the search space, accelerating the optimization process. According to experiments conducted for this study, the proposed method can reduce runtime by 89% in Random Search and search space by (1.8/13 preprocessor and 4.3/16 classifier), without compromising significant predictive performance. Moreover, the proposed method showed competitive performance when adapted to Auto-Sklearn, reducing its search space. Furthermore, this study encompasses insights into meta-feature selection, meta-model explainability, and the trade-offs inherent in search space reduction strategies.

[293] ASAP: Unsupervised Post-training with Label Distribution Shift Adaptive Learning Rate

Heewon Park, Mugon Joe, Miru Kim, Minhae Kwon

Main category: cs.LG

TL;DR: ASAP is a lightweight unsupervised method that dynamically adjusts learning rates using cosine distance between current and previous model outputs to effectively adapt to online label shifts without requiring labels or ensembles.

DetailsMotivation: Machine learning models in real-world applications face online label shift where label distributions change over time, requiring careful learning rate selection - too low slows adaptation and too high causes instability.

Method: Proposes ASAP (Adaptive Shift Aware Post-training) which dynamically adjusts learning rate by computing cosine distance between current and previous unlabeled outputs and mapping it within a bounded range. Uses only previous softmax output, no labels, ensembles, or past inputs.

Result: Experiments across multiple datasets and shift scenarios show ASAP consistently improves accuracy and efficiency compared to baseline methods.

Conclusion: ASAP provides practical, fast, and lightweight unsupervised model adaptation for online label shift scenarios, making it suitable for real-world deployment.

Abstract: In real-world applications, machine learning models face online label shift, where label distributions change over time. Effective adaptation requires careful learning rate selection: too low slows adaptation and too high causes instability. We propose ASAP (Adaptive Shift Aware Post-training), which dynamically adjusts the learning rate by computing the cosine distance between current and previous unlabeled outputs and mapping it within a bounded range. ASAP requires no labels, model ensembles, or past inputs, using only the previous softmax output for fast, lightweight adaptation. Experiments across multiple datasets and shift scenarios show ASAP consistently improves accuracy and efficiency, making it practical for unsupervised model adaptation.

[294] Hierarchy-Consistent Learning and Adaptive Loss Balancing for Hierarchical Multi-Label Classification

Ruobing Jiang, Mengzhe Liu, Haobing Liu, Yanwei Yu

Main category: cs.LG

TL;DR: HCAL classifier for hierarchical multi-label classification using prototype contrastive learning and adaptive task-weighting to maintain structural consistency and balance optimization.

DetailsMotivation: Address challenges in hierarchical multi-label classification including maintaining structural consistency and balancing loss weighting in multi-task learning, which suffers from "one-strong-many-weak" optimization bias.

Method: Proposes HCAL classifier with prototype contrastive learning, adaptive task-weighting mechanisms that dynamically allocate optimization resources, and prototype perturbation with controlled noise injection. Introduces Hierarchical Violation Rate (HVR) metric for evaluation.

Result: Extensive experiments across three datasets demonstrate higher classification accuracy and reduced hierarchical violation rate compared to baseline models.

Conclusion: The proposed HCAL classifier effectively addresses structural consistency and optimization bias issues in hierarchical multi-label classification through semantic consistency mechanisms and adaptive weighting.

Abstract: Hierarchical Multi-Label Classification (HMC) faces critical challenges in maintaining structural consistency and balancing loss weighting in Multi-Task Learning (MTL). In order to address these issues, we propose a classifier called HCAL based on MTL integrated with prototype contrastive learning and adaptive task-weighting mechanisms. The most significant advantage of our classifier is semantic consistency including both prototype with explicitly modeling label and feature aggregation from child classes to parent classes. The other important advantage is an adaptive loss-weighting mechanism that dynamically allocates optimization resources by monitoring task-specific convergence rates. It effectively resolves the “one-strong-many-weak” optimization bias inherent in traditional MTL approaches. To further enhance robustness, a prototype perturbation mechanism is formulated by injecting controlled noise into prototype to expand decision boundaries. Additionally, we formalize a quantitative metric called Hierarchical Violation Rate (HVR) as to evaluate hierarchical consistency and generalization. Extensive experiments across three datasets demonstrate both the higher classification accuracy and reduced hierarchical violation rate of the proposed classifier over baseline models.

[295] Classifying Clinical Outcome of Epilepsy Patients with Ictal Chirp Embeddings

Nooshin Bahador, Milad Lankarany

Main category: cs.LG

TL;DR: A pipeline using t-SNE for visualizing chirp features with interpretable embeddings and SHAP explanations for clinical outcome classification tasks.

DetailsMotivation: To develop an interpretable visualization framework for chirp-based features that can help with clinical stratification and decision support across diverse outcome scenarios.

Method: Used t-SNE for dimensionality reduction to preserve local neighborhood relationships, then applied four classifiers (Random Forests, SVM, Logistic Regression, k-NN) on 2D embeddings for three classification tasks, and generated SHAP-based feature influence sensitivity maps.

Result: Random Forest and k-NN classifiers achieved up to 88.8% accuracy in optimal case detection, with SHAP maps revealing spatially localized feature importance and how specific chirp attributes drive clustering patterns.

Conclusion: The integrated framework demonstrates the potential of interpretable embeddings and local feature attribution for clinical decision support and stratification using chirp-based features.

Abstract: This study presents a pipeline leveraging t-Distributed Stochastic Neighbor Embedding (t-SNE) for interpretable visualizations of chirp features across diverse outcome scenarios. The dataset, comprising chirp-based temporal, spectral, and frequency metrics. Using t-SNE, local neighborhood relationships were preserved while addressing the crowding problem through Student t-distribution-based similarity optimization. Three classification tasks were formulated on the 2D t-SNE embeddings: (1) distinguishing clinical success from failure/no-resection, (2) separating high-difficulty from low-difficulty cases, and (3) identifying optimal cases, defined as successful outcomes with minimal clinical difficulty. Four classifiers, namely, Random Forests, Support Vector Machines, Logistic Regression, and k-Nearest Neighbors, were trained and evaluated using stratified 5-fold cross-validation. Across tasks, the Random Forest and k-NN classifiers demonstrated superior performance, achieving up to 88.8% accuracy in optimal case detection (successful outcomes with minimal clinical difficulty). Additionally, feature influence sensitivity maps were generated using SHAP explanations applied to model predicting t-SNE coordinates, revealing spatially localized feature importance within the embedding space. These maps highlighted how specific chirp attributes drive regional clustering and class separation, offering insights into the latent structure of the data. The integrated framework showcases the potential of interpretable embeddings and local feature attribution for clinical stratification and decision support.

[296] DyMixOp: Guiding Neural Operator Design for PDEs from a Complex Dynamics Perspective with Local-Global-Mixing

Pengyu Lai, Yixiao Chen, Hui Xu

Main category: cs.LG

TL;DR: DyMixOp is a novel neural operator framework that transforms infinite-dimensional nonlinear PDE dynamics into finite-dimensional latent space using inertial manifold theory and Local-Global-Mixing transformation, achieving state-of-the-art performance with up to 86.7% error reduction.

DetailsMotivation: Addressing the challenge of approximating nonlinear dynamical systems governed by PDEs with neural networks, particularly dealing with non-linearizable dynamics and infinite-dimensional spaces for linearization.

Method: Uses inertial manifold theory to transform PDE dynamics into finite-dimensional latent space, incorporates Local-Global-Mixing transformation inspired by convection dynamics, and employs dynamics-informed architecture with multiple LGM layers to capture both linear and nonlinear dynamics.

Result: Achieves state-of-the-art performance across diverse PDE benchmarks, significantly reducing prediction errors (up to 86.7% in convection-dominated scenarios) while maintaining computational efficiency and scalability.

Conclusion: DyMixOp provides an effective framework for PDE approximation that maintains essential nonlinear interactions, enhances physical interpretability, and mitigates spectral bias in neural operators.

Abstract: A primary challenge in using neural networks to approximate nonlinear dynamical systems governed by partial differential equations (PDEs) is transforming these systems into a suitable format, especially when dealing with non-linearizable dynamics or the need for infinite-dimensional spaces for linearization. This paper introduces DyMixOp, a novel neural operator framework for PDEs that integrates insights from complex dynamical systems to address this challenge. Grounded in inertial manifold theory, DyMixOp transforms infinite-dimensional nonlinear PDE dynamics into a finite-dimensional latent space, establishing a structured foundation that maintains essential nonlinear interactions and enhances physical interpretability. A key innovation is the Local-Global-Mixing (LGM) transformation, inspired by convection dynamics in turbulence. This transformation effectively captures both fine-scale details and nonlinear interactions, while mitigating spectral bias commonly found in existing neural operators. The framework is further strengthened by a dynamics-informed architecture that connects multiple LGM layers to approximate linear and nonlinear dynamics, reflecting the temporal evolution of dynamical systems. Experimental results across diverse PDE benchmarks demonstrate that DyMixOp achieves state-of-the-art performance, significantly reducing prediction errors, particularly in convection-dominated scenarios reaching up to 86.7%, while maintaining computational efficiency and scalability.

[297] Uncertainty Tube Visualization of Particle Trajectories

Jixian Li, Timbwaoga Aime Judicael Ouermi, Mengjiao Han, Chris R. Johnson

Main category: cs.LG

TL;DR: This paper introduces the uncertainty tube, a novel visualization method for representing uncertainty in neural network-predicted particle trajectories, addressing the challenge of effectively quantifying and visualizing prediction uncertainty.

DetailsMotivation: Effectively quantifying and visualizing inherent uncertainty in neural network predictions remains challenging, which compromises the reliability of NN models in applications where trustworthiness is paramount, particularly in particle trajectory prediction.

Method: The paper proposes a superelliptical tube visualization method that integrates established uncertainty quantification techniques including Deep Ensembles, Monte Carlo Dropout (MC Dropout), and Stochastic Weight Averaging-Gaussian (SWAG) to capture and convey nonsymmetric uncertainty in particle paths.

Result: The uncertainty tube is demonstrated to be computationally efficient and practically useful, showcasing its application on both synthetic and simulation datasets for accurately representing uncertainty in NN-derived particle trajectories.

Conclusion: The uncertainty tube provides an effective solution for visualizing prediction uncertainty in neural network models, enhancing the reliability and trustworthiness of these models in scientific and engineering applications where uncertainty quantification is critical.

Abstract: Predicting particle trajectories with neural networks (NNs) has substantially enhanced many scientific and engineering domains. However, effectively quantifying and visualizing the inherent uncertainty in predictions remains challenging. Without an understanding of the uncertainty, the reliability of NN models in applications where trustworthiness is paramount is significantly compromised. This paper introduces the uncertainty tube, a novel, computationally efficient visualization method designed to represent this uncertainty in NN-derived particle paths. Our key innovation is the design and implementation of a superelliptical tube that accurately captures and intuitively conveys nonsymmetric uncertainty. By integrating well-established uncertainty quantification techniques, such as Deep Ensembles, Monte Carlo Dropout (MC Dropout), and Stochastic Weight Averaging-Gaussian (SWAG), we demonstrate the practical utility of the uncertainty tube, showcasing its application on both synthetic and simulation datasets.

[298] Explainability of Algorithms

Andrés Páez

Main category: cs.LG

TL;DR: The paper examines two types of AI opacity - technical complexity and intentional concealment - and explores explainable AI methods to address technical opaqueness, while noting ongoing challenges.

DetailsMotivation: To understand the different forms of algorithmic opacity in complex machine learning systems and their ethical implications for AI development.

Method: Theoretical analysis examining two types of opacity: technical complexity (inherent black box nature) and intentional concealment (proprietary reasons). Also explores existing explainable AI (XAI) methods developed to address technical opaqueness.

Result: Identifies distinct ethical implications from each type of opacity. Finds that while XAI methods exist to overcome technical opaqueness, they still face numerous challenges and limitations.

Conclusion: Algorithmic opacity manifests in different forms with distinct ethical consequences. Explainable AI approaches provide partial solutions but significant challenges remain in making complex AI systems truly transparent and accountable.

Abstract: The opaqueness of many complex machine learning algorithms is often mentioned as one of the main obstacles to the ethical development of artificial intelligence (AI). But what does it mean for an algorithm to be opaque? Highly complex algorithms such as artificial neural networks process enormous volumes of data in parallel along multiple hidden layers of interconnected nodes, rendering their inner workings epistemically inaccessible to any human being, including their designers and developers; they are “black boxes” for all their stakeholders. But opaqueness is not always the inevitable result of technical complexity. Sometimes, the way an algorithm works is intentionally hidden from view for proprietary reasons, especially in commercial automated decision systems, creating an entirely different type of opaqueness. In the first part of the chapter, we will examine these two ways of understanding opacity and the ethical implications that stem from each of them. In the second part, we explore the different explanatory methods that have been developed in computer science to overcome an AI system’s technical opaqueness. As the analysis shows, explainable AI (XAI) still faces numerous challenges.

[299] MuFlex: A Scalable, Physics-based Platform for Multi-Building Flexibility Analysis and Coordination

Ziyan Wu, Ivan Korolija, Rui Tang

Main category: cs.LG

TL;DR: MuFlex is an open-source platform for benchmarking multi-building control strategies that enables synchronous EnergyPlus simulations with standardized RL interfaces, demonstrating effective demand flexibility coordination.

DetailsMotivation: Existing building testbeds are limited to single buildings or use simplified models that cannot capture physical intricacies and intermediate variables needed for proper control performance interpretation. Multi-building platforms are scarce and often impose rigid formats.

Method: Developed MuFlex - a scalable open-source platform that enables synchronous information exchange across EnergyPlus building models and adheres to OpenAI Gym interface for modular RL implementation. Tested using Soft Actor-Critic algorithm with fine-tuned hyperparameters on four office buildings.

Result: Aggregating four buildings’ flexibility reduced total peak demand below specified threshold while maintaining indoor environmental quality. The platform successfully demonstrated coordinated demand flexibility.

Conclusion: MuFlex addresses critical gaps in multi-building control benchmarking by providing a scalable, physically-detailed platform with standardized interfaces, enabling effective coordination of building flexibility for grid balancing with renewable integration.

Abstract: With the increasing penetration of renewable generation on the power grid, maintaining system balance requires coordinated demand flexibility from aggregations of buildings. Reinforcement learning (RL) has been widely explored for building controls because of its model-free nature. Open-source simulation testbeds are essential not only for training RL agents but also for fairly benchmarking control strategies. However, most building-sector testbeds target single buildings; multi-building platforms are relatively limited and typically rely on simplified models (e.g., Resistance-Capacitance) or data-driven approaches, which lack the ability to fully capture the physical intricacies and intermediate variables necessary for interpreting control performance. Moreover, these platforms often impose fixed inputs, outputs, and model formats, restricting their applicability as benchmarking tools across diverse control scenarios. To address these gaps, MuFlex, a scalable, open-source platform for benchmarking and testing control strategies for multi-building flexibility coordination, was developed in this study. MuFlex enables synchronous information exchange across EnergyPlus building models and adheres to the latest OpenAI Gym interface, providing a modular, standardized RL implementation. The platform capabilities were demonstrated in a case study coordinating demand flexibility across four office buildings using the Soft Actor-Critic algorithm with carefully fine-tuned hyperparameters. The results show that aggregating the four buildings flexibility reduced total peak demand below a specified threshold while maintaining indoor environmental quality.

[300] CALYPSO: Forecasting and Analyzing MRSA Infection Patterns with Community and Healthcare Transmission Dynamics

Rituparna Datta, Jiaming Cui, Gregory R. Madden, Anil Vullikanti

Main category: cs.LG

TL;DR: CALYPSO is a hybrid framework combining neural networks with mechanistic epidemic models to forecast MRSA spread across healthcare settings, achieving 4.5% better performance than ML baselines while enabling interpretable predictions and policy analysis.

DetailsMotivation: MRSA is a critical public health threat, but existing forecasting models lack epidemiological interpretability and have limited performance. Mechanistic models are difficult to calibrate and limited in incorporating diverse datasets.

Method: Hybrid framework integrating neural networks with mechanistic metapopulation models, leveraging patient-level insurance claims, commuting data, and healthcare transfer patterns to learn region- and time-specific parameters for MRSA spread.

Result: CALYPSO improves statewide forecasting performance by over 4.5% compared to machine learning baselines, enables accurate forecasts at multiple spatial resolutions, and identifies high-risk regions and cost-effective infection prevention strategies.

Conclusion: The hybrid approach provides both accurate forecasting and epidemiological interpretability, supporting counterfactual analyses of infection control policies and outbreak risks for better public health decision-making.

Abstract: Methicillin-resistant Staphylococcus aureus (MRSA) is a critical public health threat within hospitals as well as long-term care facilities. Better understanding of MRSA risks, evaluation of interventions and forecasting MRSA rates are important public health problems. Existing forecasting models rely on statistical or neural network approaches, which lack epidemiological interpretability, and have limited performance. Mechanistic epidemic models are difficult to calibrate and limited in incorporating diverse datasets. We present CALYPSO, a hybrid framework that integrates neural networks with mechanistic metapopulation models to capture the spread dynamics of infectious diseases (i.e., MRSA) across healthcare and community settings. Our model leverages patient-level insurance claims, commuting data, and healthcare transfer patterns to learn region- and time-specific parameters governing MRSA spread. This enables accurate, interpretable forecasts at multiple spatial resolutions (county, healthcare facility, region, state) and supports counterfactual analyses of infection control policies and outbreak risks. We also show that CALYPSO improves statewide forecasting performance by over 4.5% compared to machine learning baselines, while also identifying high-risk regions and cost-effective strategies for allocating infection prevention resources.

[301] Collapsing ROC approach for risk prediction research on both common and rare variants

Changshuai Wei, Qing Lu

Main category: cs.LG

TL;DR: Proposes CROC method for genetic risk prediction combining common and rare variants, showing improved accuracy over common-variant-only approaches

DetailsMotivation: Current genetic risk prediction models using only common variants lack sufficient accuracy for clinical use, while rare variants remain understudied

Method: Developed collapsing receiver operating characteristic (CROC) approach as extension of forward ROC (FROC) method, with procedures for handling rare variants. Evaluated using 533 SNPs from 37 genes in Genetic Analysis Workshop 17 data

Result: Prediction model using all SNPs (common + rare) achieved AUC=0.605 vs common variants alone AUC=0.585. CROC outperformed FROC especially when common variants decreased, with AUC=0.603 vs 0.524 in rare-variant-only scenario

Conclusion: Comprehensive risk prediction strategy incorporating both common and rare variants improves accuracy, with CROC method particularly effective for rare variant analysis

Abstract: Risk prediction that capitalizes on emerging genetic findings holds great promise for improving public health and clinical care. However, recent risk prediction research has shown that predictive tests formed on existing common genetic loci, including those from genome-wide association studies, have lacked sufficient accuracy for clinical use. Because most rare variants on the genome have not yet been studied for their role in risk prediction, future disease prediction discoveries should shift toward a more comprehensive risk prediction strategy that takes into account both common and rare variants. We are proposing a collapsing receiver operating characteristic CROC approach for risk prediction research on both common and rare variants. The new approach is an extension of a previously developed forward ROC FROC approach, with additional procedures for handling rare variants. The approach was evaluated through the use of 533 single-nucleotide polymorphisms SNPs in 37 candidate genes from the Genetic Analysis Workshop 17 mini-exome data set. We found that a prediction model built on all SNPs gained more accuracy AUC = 0.605 than one built on common variants alone AUC = 0.585. We further evaluated the performance of two approaches by gradually reducing the number of common variants in the analysis. We found that the CROC method attained more accuracy than the FROC method when the number of common variants in the data decreased. In an extreme scenario, when there are only rare variants in the data, the CROC reached an AUC value of 0.603, whereas the FROC had an AUC value of 0.524.

[302] Prediction of Hospital Associated Infections During Continuous Hospital Stays

Rituparna Datta, Methun Kamruzzaman, Eili Y. Klein, Gregory R Madden, Xinwei Deng, Anil Vullikanti, Parantapa Bhattacharya

Main category: cs.LG

TL;DR: A novel generative probabilistic model called GenHAI is presented for modeling MRSA test result sequences during hospitalizations to help mitigate MRSA infection risks.

DetailsMotivation: MRSA is a serious antimicrobial resistance threat with high risks for hospitalized patients due to co-morbid conditions, immunosuppression, antibiotic use, and contact with contaminated hospital environments.

Method: Developed GenHAI, a generative probabilistic model based on probabilistic programming paradigm that can model sequences of MRSA test results and answer predictive, causal, and counterfactual questions.

Result: The model demonstrated efficacy when compared against discriminative and generative machine learning models using two real-world datasets.

Conclusion: GenHAI provides hospital administrators with a powerful tool for understanding and mitigating MRSA infection risks through various analytical capabilities including predictive, causal, and counterfactual analysis.

Abstract: The US Centers for Disease Control and Prevention (CDC), in 2019, designated Methicillin-resistant Staphylococcus aureus (MRSA) as a serious antimicrobial resistance threat. The risk of acquiring MRSA and suffering life-threatening consequences due to it remains especially high for hospitalized patients due to a unique combination of factors, including: co-morbid conditions, immuno suppression, antibiotic use, and risk of contact with contaminated hospital workers and equipment. In this paper, we present a novel generative probabilistic model, GenHAI, for modeling sequences of MRSA test results outcomes for patients during a single hospitalization. This model can be used to answer many important questions from the perspectives of hospital administrators for mitigating the risk of MRSA infections. Our model is based on the probabilistic programming paradigm, and can be used to approximately answer a variety of predictive, causal, and counterfactual questions. We demonstrate the efficacy of our model by comparing it against discriminative and generative machine learning models using two real-world datasets.

[303] Input Time Scaling

Rapheal Huang, Weilong Guo

Main category: cs.LG

TL;DR: Introduces Input Time Scaling paradigm that refines queries using meta-knowledge from LLMs, challenging the “garbage in, garbage out” principle by showing that seemingly low-quality data can achieve high performance when properly processed during both training and testing phases.

DetailsMotivation: To complement existing scaling methods (data scaling and inference scaling) by focusing on query refinement at input time, and to challenge conventional wisdom about data quality requirements for LLM performance.

Method: Proposes Input Time Scaling that combines meta-knowledge from LLMs to refine inputs with different strategies during both training and testing phases, requiring co-design approach.

Result: Achieves SOTA performance on AIME24 (76.7%) and AIME25 (76.7%) with 32B models, and further improves to 80% on AIME25 with majority voting. Contradicts “garbage in, garbage out” principle - adding irrelevant information and using minimally filtered datasets can perform best.

Conclusion: Input Time Scaling is an effective paradigm that challenges traditional data quality assumptions. Training-testing co-design is crucial, and less data with proper query strategies can outperform larger datasets. Findings align with “Less is More” phenomenon and suggest rethinking dataset curation practices.

Abstract: Current Large Language Models (LLMs) are usually post-trained on large-scale carefully curated datasets (data & training scaling) and doing reasoning in test time (inference time scaling). In this work, we present a new scaling paradigm, Input Time Scaling, to complement previous scaling methods by putting resources on queries (input time). During training and testing, we combine meta-knowledge from LLMs to refine inputs with different strategies. We also find a new phenomenon, training-testing co-design there. We need to apply query strategies during both training and testing. Only applying strategies on training or testing would seriously degrade the performance. We are also surprised to find that seemingly low data quality datasets can gain high performance. Adding irrelevant information to the queries, randomly selecting examples from a minimally filtered dataset, can even perform the best. These findings contradict the widely held inductive bias, “garbage in, garbage out”. Curating datasets with seemingly high-quality data can even potentially limit the performance ceiling. In addition, models trained on more data with similar quality (15k VS 1k) perform worse, simple dataset size scaling should also be carefully inspected. The good news is that our findings are compatible with the Less is More phenomenon. A small set of examples is enough to evoke high-level reasoning ability. With experiments on models trained on Qwen2.5-32B-Instruct, we are able to reach SOTA performance among 32B models on AIME24(76.7%) and AIME25(76.7%) pass@1. We can further achieve AIME24(76.7%) and AIME25(80%) with a majority vote of three models. Starting from DeepSeek-R1-Distill-Qwen-32B, the best result would be 86.7% on AIME24 and 76.7% on AIME25. To facilitate reproducibility and further research, we are working on open-source our datasets, data pipelines, evaluation results, and checkpoints.

[304] A Generalized Learning Framework for Self-Supervised Contrastive Learning

Lingyu Si, Jingyao Wang, Wenwen Qiang

Main category: cs.LG

TL;DR: The paper proposes a Generalized Learning Framework (GLF) that unifies self-supervised contrastive learning methods and introduces Adaptive Distribution Calibration (ADC) to improve intra-class compactness and inter-class separability without labels.

DetailsMotivation: To generalize and unify existing self-supervised contrastive learning methods (BYOL, Barlow Twins, SwAV) under a common framework and address the challenge of designing effective constraints without labeled data.

Method: Proposes GLF with aligning and constraining parts, analyzes existing methods under this framework, and introduces ADC - a plug-and-play method that captures dynamic relationships between samples to ensure proper feature space geometry.

Result: Theoretical analysis and empirical evaluation demonstrate ADC’s superiority in achieving better intra-class compactness and inter-class separability compared to existing methods.

Conclusion: The GLF framework successfully unifies existing SSCL methods, and ADC provides an effective solution for designing constraining parts that preserve class information in self-supervised learning without labels.

Abstract: Self-supervised contrastive learning (SSCL) has recently demonstrated superiority in multiple downstream tasks. In this paper, we generalize the standard SSCL methods to a Generalized Learning Framework (GLF) consisting of two parts: the aligning part and the constraining part. We analyze three existing SSCL methods: BYOL, Barlow Twins, and SwAV, and show that they can be unified under GLF with different choices of the constraining part. We further propose empirical and theoretical analyses providing two insights into designing the constraining part of GLF: intra-class compactness and inter-class separability, which measure how well the feature space preserves the class information of the inputs. However, since SSCL can not use labels, it is challenging to design a constraining part that satisfies these properties. To address this issue, we consider inducing intra-class compactness and inter-class separability by iteratively capturing the dynamic relationship between anchor and other samples and propose a plug-and-play method called Adaptive Distribution Calibration (ADC) to ensure that samples that are near or far from the anchor point in the original input space are closer or further away from the anchor point in the feature space. Both the theoretical analysis and the empirical evaluation demonstrate the superiority of ADC.

[305] Approximate Bayesian Inference via Bitstring Representations

Aleksanteri Sladek, Martin Trapp, Arno Solin

Main category: cs.LG

TL;DR: Probabilistic inference in quantized parameter spaces using discrete approximations for scalable and interpretable machine learning.

DetailsMotivation: The machine learning community needs scalable solutions for large models through quantized/low-precision arithmetics, while maintaining probabilistic inference capabilities and model interpretability.

Method: Proposes performing probabilistic inference in quantized discrete parameter spaces using probabilistic circuits for tractable learning, applied to both 2D densities and quantized neural networks.

Result: Validated with various models showing inference efficiency without accuracy loss, providing clear insights into model behavior through discrete approximations.

Conclusion: Advances scalable and interpretable machine learning by enabling continuous distribution learning using discrete parameters through probabilistic computations in quantized spaces.

Abstract: The machine learning community has recently put effort into quantized or low-precision arithmetics to scale large models. This paper proposes performing probabilistic inference in the quantized, discrete parameter space created by these representations, effectively enabling us to learn a continuous distribution using discrete parameters. We consider both 2D densities and quantized neural networks, where we introduce a tractable learning approach using probabilistic circuits. This method offers a scalable solution to manage complex distributions and provides clear insights into model behavior. We validate our approach with various models, demonstrating inference efficiency without sacrificing accuracy. This work advances scalable, interpretable machine learning by utilizing discrete approximations for probabilistic computations.

[306] Bounding Causal Effects and Counterfactuals

Tobias Maringgele

Main category: cs.LG

TL;DR: Systematic comparison of partial identification methods for causal inference, including new extensions and practical guidance for practitioners.

DetailsMotivation: Partial identification provides principled causal bounds without strong assumptions, but remains underutilized due to fragmented methods and lack of practical guidance.

Method: Implemented, extended, and unified state-of-the-art bounding algorithms (symbolic, optimization-based, information-theoretic) within common framework. Extended entropy-bounded method for counterfactual queries. Conducted thousands of randomized simulations with discrete/continuous data.

Result: Comprehensive evaluation of methods on bound tightness, computational efficiency, and robustness. Developed practical decision tree for algorithm selection and ML model to predict best method based on data characteristics.

Conclusion: Provides unified framework and open-source package (CausalBoundingEngine) to make partial identification methods more accessible and practical for applied causal inference work.

Abstract: Causal inference often hinges on strong assumptions - such as no unmeasured confounding or perfect compliance - that are rarely satisfied in practice. Partial identification offers a principled alternative: instead of relying on unverifiable assumptions to estimate causal effects precisely, it derives bounds that reflect the uncertainty inherent in the data. Despite its theoretical appeal, partial identification remains underutilized in applied work, in part due to the fragmented nature of existing methods and the lack of practical guidance. This thesis addresses these challenges by systematically comparing a diverse set of bounding algorithms across multiple causal scenarios. We implement, extend, and unify state-of-the-art methods - including symbolic, optimization-based, and information-theoretic approaches - within a common evaluation framework. In particular, we propose an extension of a recently introduced entropy-bounded method, making it applicable to counterfactual queries such as the Probability of Necessity and Sufficiency (PNS). Our empirical study spans thousands of randomized simulations involving both discrete and continuous data-generating processes. We assess each method in terms of bound tightness, computational efficiency, and robustness to assumption violations. To support practitioners, we distill our findings into a practical decision tree for algorithm selection and train a machine learning model to predict the best-performing method based on observable data characteristics. All implementations are released as part of an open-source Python package, CausalBoundingEngine, which enables users to apply and compare bounding methods through a unified interface.

[307] Towards a Larger Model via One-Shot Federated Learning on Heterogeneous Client Models

Wenxuan Ye, Xueli An, Onur Ayan, Junfan Wang, Xueqiang Yan, Georg Carle

Main category: cs.LG

TL;DR: FedOL enables one-shot federated learning through knowledge distillation using public dataset predictions instead of model parameter sharing, reducing communication overhead and supporting heterogeneous client models while addressing data bias through iterative pseudo-label refinement.

DetailsMotivation: Traditional federated learning requires uniform model architectures and multiple communication rounds, which neglects resource heterogeneity, imposes heavy computational demands on clients, and increases communication overhead - particularly problematic for mobile networks with limited client resources.

Method: FedOL uses knowledge distillation where clients exchange model prediction outputs on an unlabeled public dataset instead of sharing model parameters. It employs a specialized objective function that iteratively refines pseudo-labels and the server model, along with a tailored pseudo-label generation and knowledge distillation strategy to integrate diverse knowledge effectively.

Result: Simulation results show that FedOL significantly outperforms existing baselines, offering a cost-effective solution for mobile networks where clients have valuable private data but limited computational resources.

Conclusion: FedOL provides an efficient one-shot federated learning approach that reduces communication overhead, supports model heterogeneity, and effectively handles biased client predictions through innovative pseudo-label refinement and knowledge distillation techniques.

Abstract: Large models, renowned for superior performance, outperform smaller ones even without billion-parameter scales. While mobile network servers have ample computational resources to support larger models than client devices, privacy constraints prevent clients from directly sharing their raw data. Federated Learning (FL) enables decentralized clients to collaboratively train a shared model by exchanging model parameters instead of transmitting raw data. Yet, it requires a uniform model architecture and multiple communication rounds, which neglect resource heterogeneity, impose heavy computational demands on clients, and increase communication overhead. To address these challenges, we propose FedOL, to construct a larger and more comprehensive server model in one-shot settings (i.e., in a single communication round). Instead of model parameter sharing, FedOL employs knowledge distillation, where clients only exchange model prediction outputs on an unlabeled public dataset. This reduces communication overhead by transmitting compact predictions instead of full model weights and enables model customization by allowing heterogeneous model architectures. A key challenge in this setting is that client predictions may be biased due to skewed local data distributions, and the lack of ground-truth labels in the public dataset further complicates reliable learning. To mitigate these issues, FedOL introduces a specialized objective function that iteratively refines pseudo-labels and the server model, improving learning reliability. To complement this, FedOL incorporates a tailored pseudo-label generation and knowledge distillation strategy that effectively integrates diverse knowledge. Simulation results show that FedOL significantly outperforms existing baselines, offering a cost-effective solution for mobile networks where clients possess valuable private data but limited computational resources.

[308] Text2Weight: Bridging Natural Language and Neural Network Weight Spaces

Bowen Tian, Wenshuo Chen, Zexi Li, Songning Lai, Jiemin Wu, Yutao Yue

Main category: cs.LG

TL;DR: T2W is a diffusion transformer framework that generates neural network weights from natural language descriptions, enabling task-specific weight generation with strong generalization to unseen tasks.

DetailsMotivation: Current neural network weight generation approaches struggle with generalization to unseen tasks and lack practical application exploration, limiting their real-world utility.

Method: Hierarchically processes network parameters into uniform blocks, integrates CLIP text embeddings via prior attention mechanism, and uses adversarial training with weight-space augmentation for enhanced generalization.

Result: Outperforms optimization-based initialization on Cifar100, Caltech256, and TinyImageNet, demonstrating ability to produce high-quality weights for unseen tasks and enabling novel applications like weight enhancement and text-guided model fusion.

Conclusion: Bridges textual semantics with weight-space dynamics, advances practicality of generative models in neural network parameter synthesis, and provides open-source dataset of text-weight pairs.

Abstract: How far are we really from automatically generating neural networks? While neural network weight generation shows promise, current approaches struggle with generalization to unseen tasks and practical application exploration. To address this, we propose T2W, a diffusion transformer framework that generates task-specific weights conditioned on natural language descriptions. T2W hierarchically processes network parameters into uniform blocks, integrates text embeddings from CLIP via a prior attention mechanism, and employs adversarial training with weight-space augmentation to enhance generalization. Experiments on Cifar100, Caltech256, and TinyImageNet demonstrate T2W’s ability to produce high-quality weights for unseen tasks, outperforming optimization-based initialization and enabling novel applications such as weight enhancement and text-guided model fusion. Our work bridges textual semantics with weight-space dynamics, supported by an open-source dataset of text-weight pairs, advancing the practicality of generative models in neural network parameter synthesis. Our code is available on Github.

[309] Explainable Learning Rate Regimes for Stochastic Optimization

Zhuang Yang

Main category: cs.LG

TL;DR: Automatic learning rate adjustment based on stochastic gradient variations without manual hyperparameter tuning

DetailsMotivation: Existing learning rate regimes are complex, require manual hyperparameter tuning, and incur high computational costs in practice

Method: Developed an explainable learning rate regime using stochastic second-order algorithms that automatically adjusts LR based on the norm of stochastic gradients - increases when gradient norm decreases, decreases when gradient norm increases

Result: The proposed LR regime shows efficiency, robustness, and scalability across different stochastic algorithms (SGD, SGDM, SIGNSGD) on machine learning tasks

Conclusion: The method provides a natural, direct, and parameter-free approach to automatic learning rate adjustment that performs comparably to heuristic algorithms without requiring manual tuning

Abstract: Modern machine learning is trained by stochastic gradient descent (SGD), whose performance critically depends on how the learning rate (LR) is adjusted and decreased over time. Yet existing LR regimes may be intricate, or need to tune one or more additional hyper-parameters manually whose bottlenecks include huge computational expenditure, time and power in practice. This work, in a natural and direct manner, clarifies how LR should be updated automatically only according to the intrinsic variation of stochastic gradients. An explainable LR regime by leveraging stochastic second-order algorithms is developed, behaving a similar pattern to heuristic algorithms but implemented simply without any parameter tuning requirement, where it is of an automatic procedure that LR should increase (decrease) as the norm of stochastic gradients decreases (increases). The resulting LR regime shows its efficiency, robustness, and scalability in different classical stochastic algorithms, containing SGD, SGDM, and SIGNSGD, on machine learning tasks.

[310] Personalized Subgraph Federated Learning with Sheaf Collaboration

Wenfei Liang, Yanan Zhao, Rui She, Yiming Li, Wee Peng Tay

Main category: cs.LG

TL;DR: FedSheafHN is a novel personalized subgraph federated learning framework that uses sheaf collaboration and hypernetworks to address client heterogeneity and improve model performance across diverse local subgraphs.

DetailsMotivation: Performance variation across clients in personalized subgraph FL due to heterogeneous local data distributions remains a key challenge that needs to be addressed.

Method: Embeds client local subgraphs into a server-constructed collaboration graph using graph-level embeddings, employs sheaf diffusion to enrich client representations, and generates customized models via a server-optimized hypernetwork.

Result: Outperforms existing personalized subgraph FL methods on various graph datasets, exhibits fast model convergence, and effectively generalizes to new clients.

Conclusion: FedSheafHN successfully addresses client heterogeneity in subgraph FL through sheaf collaboration and hypernetwork-based personalization, demonstrating superior performance and generalization capabilities.

Abstract: Graph-structured data is prevalent in many applications. In subgraph federated learning (FL), this data is distributed across clients, each with a local subgraph. Personalized subgraph FL aims to develop a customized model for each client to handle diverse data distributions. However, performance variation across clients remains a key issue due to the heterogeneity of local subgraphs. To overcome the challenge, we propose FedSheafHN, a novel framework built on a sheaf collaboration mechanism to unify enhanced client descriptors with efficient personalized model generation. Specifically, FedSheafHN embeds each client’s local subgraph into a server-constructed collaboration graph by leveraging graph-level embeddings and employing sheaf diffusion within the collaboration graph to enrich client representations. Subsequently, FedSheafHN generates customized client models via a server-optimized hypernetwork. Empirical evaluations demonstrate that FedSheafHN outperforms existing personalized subgraph FL methods on various graph datasets. Additionally, it exhibits fast model convergence and effectively generalizes to new clients.

[311] GRAFT: Gradient-Aware Fast MaxVol Technique for Dynamic Data Sampling

Ashish Jha, Anh huy Phan, Razan Dibo, Valentin Leplat

Main category: cs.LG

TL;DR: GRAFT is an efficient in-training subset selection method that reduces computational costs and environmental impact by selecting diverse, representative examples from low-rank feature subspaces instead of training on full batches.

DetailsMotivation: Training modern neural networks on large datasets is computationally intensive and environmentally costly due to high energy consumption and CO2 emissions. There's a need for methods that can reduce these costs while maintaining training effectiveness.

Method: GRAFT uses a three-step approach: (1) extracts low-rank feature representations for each batch, (2) applies Fast MaxVol sampler to select a small diverse subset spanning the batch’s dominant subspace, and (3) dynamically adjusts subset size using gradient-approximation criterion.

Result: GRAFT matches or exceeds recent selection baselines in both accuracy and efficiency across multiple benchmarks, while significantly reducing wall-clock time, energy consumption, and CO2 emissions.

Conclusion: GRAFT provides a favorable trade-off between accuracy, efficiency, and environmental impact, offering a scalable solution for reducing the computational and environmental costs of neural network training.

Abstract: Training modern neural networks on large datasets is computationally and environmentally costly. We introduce GRAFT, a scalable in-training subset selection method that (i) extracts a low-rank feature representation for each batch, (ii) applies a Fast MaxVol sampler to select a small, diverse subset that spans the batch’s dominant subspace, and (iii) dynamically adjusts the subset size using a gradient-approximation criterion. By operating in low-rank subspaces and training on carefully chosen examples instead of full batches, GRAFT preserves the training trajectory while reducing wall-clock time, energy consumption, and $\mathrm{CO}_2$ emissions. Across multiple benchmarks, GRAFT matches or exceeds recent selection baselines in both accuracy and efficiency, providing a favorable trade-off between accuracy, efficiency, and emissions.

[312] In-Context Decision Making for Optimizing Complex AutoML Pipelines

Amir Rezaei Balef, Katharina Eggensperger

Main category: cs.LG

TL;DR: Extends CASH framework to modern ML pipelines with pre-trained models, proposes PS-PFN using posterior sampling and prior-data fitted networks for efficient pipeline adaptation.

DetailsMotivation: Traditional AutoML systems focus on algorithm selection and hyperparameter optimization (CASH), but modern ML workflows require fine-tuning, ensembling, and adaptation techniques for pre-trained models, demanding new AutoML approaches.

Method: Proposes PS-PFN that extends Posterior Sampling to max k-armed bandit problem, leveraging prior-data fitted networks (PFNs) for efficient posterior distribution estimation via in-context learning. Handles varying costs and individual reward distributions per arm.

Result: Experimental results on one novel and two existing benchmark tasks show superior performance compared to other bandit and AutoML strategies.

Conclusion: PS-PFN successfully extends the CASH framework to handle modern heterogeneous ML pipelines, providing an effective approach for selecting and adapting pre-trained models and complex ML workflows.

Abstract: Combined Algorithm Selection and Hyperparameter Optimization (CASH) has been fundamental to traditional AutoML systems. However, with the advancements of pre-trained models, modern ML workflows go beyond hyperparameter optimization and often require fine-tuning, ensembling, and other adaptation techniques. While the core challenge of identifying the best-performing model for a downstream task remains, the increasing heterogeneity of ML pipelines demands novel AutoML approaches. This work extends the CASH framework to select and adapt modern ML pipelines. We propose PS-PFN to efficiently explore and exploit adapting ML pipelines by extending Posterior Sampling (PS) to the max k-armed bandit problem setup. PS-PFN leverages prior-data fitted networks (PFNs) to efficiently estimate the posterior distribution of the maximal value via in-context learning. We show how to extend this method to consider varying costs of pulling arms and to use different PFNs to model reward distributions individually per arm. Experimental results on one novel and two existing standard benchmark tasks demonstrate the superior performance of PS-PFN compared to other bandit and AutoML strategies. We make our code and data available at https://github.com/amirbalef/CASHPlus.

[313] Heavy-tailed Linear Bandits: Adversarial Robustness, Best-of-both-worlds, and Beyond

Canzhe Zhao, Shinji Ito, Shuai Li

Main category: cs.LG

TL;DR: First framework for adversarial heavy-tailed bandits using FTRL with bonus functions, achieving best-of-both-worlds results for MABs and linear bandits with finite arms, plus new HT-SPM learning rate.

DetailsMotivation: Prior work focused almost exclusively on stochastic heavy-tailed bandits, with few exceptions limited to MABs. There was a gap in adversarial heavy-tailed bandit frameworks, particularly for linear settings.

Method: Follow-the-regularized-leader (FTRL) over loss estimates shifted by a bonus function. Heavy-tailed noise aware stability-penalty matching (HT-SPM) learning rate. Variance-reduced linear loss estimator for linear bandits.

Result: First FTRL-type BOBW algorithm for heavy-tailed MABs: O~(T^(1/ε)) adversarial regret and O~(log T) stochastic regret. First algorithm for adversarial heavy-tailed linear bandits: O~(d^(1/2)T^(1/ε)) regret matching stochastic bounds. First BOBW result for heavy-tailed linear bandits.

Conclusion: The framework successfully addresses adversarial heavy-tailed bandits across both MAB and linear settings, achieving state-of-the-art regret bounds and introducing novel techniques like HT-SPM learning rate.

Abstract: Heavy-tailed bandits have been extensively studied since the seminal work of \citet{Bubeck2012BanditsWH}. In particular, heavy-tailed linear bandits, enabling efficient learning with both a large number of arms and heavy-tailed noises, have recently attracted significant attention \citep{ShaoYKL18,XueWWZ20,ZhongHYW21,Wang2025heavy,tajdini2025improved}. However, prior studies focus almost exclusively on stochastic regimes, with few exceptions limited to the special case of heavy-tailed multi-armed bandits (MABs) \citep{Huang0H22,ChengZ024,Chen2024uniINF}. In this work, we propose a general framework for adversarial heavy-tailed bandit problems, which performs follow-the-regularized-leader (FTRL) over the loss estimates shifted by a bonus function. Via a delicate setup of the bonus function, we devise the first FTRL-type best-of-both-worlds (BOBW) algorithm for heavy-tailed MABs, which does not require the truncated non-negativity assumption and achieves an $\widetilde{O}(T^{\frac{1}{\varepsilon}})$ worst-case regret in the adversarial regime as well as an $\widetilde{O}(\log T)$ gap-dependent regret in the stochastic regime. We then extend our framework to the linear case, proposing the first algorithm for adversarial heavy-tailed linear bandits with finite arm sets. This algorithm achieves an $\widetilde{O}(d^{\frac{1}{2}}T^{\frac{1}{\varepsilon}})$ regret, matching the best-known worst-case regret bound in stochastic regimes. Moreover, we propose a general data-dependent learning rate, termed \textit{heavy-tailed noise aware stability-penalty matching} (HT-SPM). We prove that HT-SPM guarantees BOBW regret bounds for general heavy-tailed bandit problems once certain conditions are satisfied. By using HT-SPM and, in particular, a variance-reduced linear loss estimator, we obtain the first BOBW result for heavy-tailed linear bandits.

[314] Minimizing the Weighted Number of Tardy Jobs: Data-Driven Heuristic for Single-Machine Scheduling

Nikolai Antonov, Prěmysl Šůcha, Mikoláš Janota, Jan Hůla

Main category: cs.LG

TL;DR: Novel data-driven heuristic for single-machine scheduling that combines ML with problem-specific constraints to minimize tardy job weights, outperforming state-of-the-art methods.

DetailsMotivation: Existing exact algorithms for single-machine scheduling perform poorly on certain problem regions, while data-driven approaches offer scalable performance when tailored to specific datasets.

Method: Developed a machine learning-based scheduling heuristic that incorporates job weight, duration, due date, and deadline constraints to ensure feasible solutions while minimizing total weight of tardy jobs.

Result: Experimental results show significant outperformance over state-of-the-art methods in optimality gap, number of optimal solutions, and adaptability across varied data scenarios.

Conclusion: The approach provides a flexible and practical solution for single-machine scheduling, with detailed ML model selection process offering insights into optimal model choices for this problem domain.

Abstract: Existing research on single-machine scheduling is largely focused on exact algorithms, which perform well on typical instances but can significantly deteriorate on certain regions of the problem space. In contrast, data-driven approaches provide strong and scalable performance when tailored to the structure of specific datasets. Leveraging this idea, we focus on a single-machine scheduling problem where each job is defined by its weight, duration, due date, and deadline, aiming to minimize the total weight of tardy jobs. We introduce a novel data-driven scheduling heuristic that combines machine learning with problem-specific characteristics, ensuring feasible solutions, which is a common challenge for ML-based algorithms. Experimental results demonstrate that our approach significantly outperforms the state-of-the-art in terms of optimality gap, number of optimal solutions, and adaptability across varied data scenarios, highlighting its flexibility for practical applications. In addition, we conduct a systematic exploration of ML models, addressing a common gap in similar studies by offering a detailed model selection process and providing insights into why the chosen model is the best fit.

[315] Trans-XFed: An Explainable Federated Learning for Supply Chain Credit Assessment

Jie Shi, Arno P. J. M. Siebes, Siamak Mehrkanoon

Main category: cs.LG

TL;DR: Trans-XFed combines federated learning with explainable AI for supply chain credit assessment, addressing privacy, data imbalance, and interpretability challenges through performance-based client selection and enhanced FedProx with transformer encoder.

DetailsMotivation: Address privacy concerns, information silos, class imbalance, Non-IID data distribution, and lack of model interpretability in supply chain credit assessment systems.

Method: Performance-based client selection strategy (PBCS) for faster convergence, FedProx architecture with homomorphic encryption, transformer encoder for feature insights, and integrated gradient XAI techniques.

Result: Experimental evaluations on real-world datasets show accurate credit assessments compared to baselines while maintaining transparency and privacy.

Conclusion: Trans-XFed effectively addresses multiple challenges in supply chain credit assessment by combining federated learning with explainable AI, delivering both accuracy and interpretability while preserving data privacy.

Abstract: This paper proposes a Trans-XFed architecture that combines federated learning with explainable AI techniques for supply chain credit assessment. The proposed model aims to address several key challenges, including privacy, information silos, class imbalance, non-identically and independently distributed (Non-IID) data, and model interpretability in supply chain credit assessment. We introduce a performance-based client selection strategy (PBCS) to tackle class imbalance and Non-IID problems. This strategy achieves faster convergence by selecting clients with higher local F1 scores. The FedProx architecture, enhanced with homomorphic encryption, is used as the core model, and further incorporates a transformer encoder. The transformer encoder block provides insights into the learned features. Additionally, we employ the integrated gradient explainable AI technique to offer insights into decision-making. We demonstrate the effectiveness of Trans-XFed through experimental evaluations on real-world supply chain datasets. The obtained results show its ability to deliver accurate credit assessments compared to several baselines, while maintaining transparency and privacy.

[316] DREAMS: Preserving both Local and Global Structure in Dimensionality Reduction

Noël Kury, Dmitry Kobak, Sebastian Damrich

Main category: cs.LG

TL;DR: DREAMS is a new dimensionality reduction method that combines local structure preservation (like t-SNE) with global structure preservation (like PCA) through a regularization term, creating a spectrum of embeddings that balance both aspects effectively.

DetailsMotivation: Existing dimensionality reduction methods preserve either local structure (t-SNE, UMAP) or global structure (MDS, PCA), but no established method can represent both aspects well simultaneously.

Method: DREAMS combines t-SNE’s local structure preservation with PCA’s global structure preservation via a simple regularization term, generating a spectrum of embeddings between these two extremes.

Result: Benchmarked across seven real-world datasets (including single-cell transcriptomics and population genetics), DREAMS shows superior qualitative and quantitative performance in preserving structure across multiple scales compared to previous approaches.

Conclusion: DREAMS effectively bridges the gap between local and global structure preservation in dimensionality reduction, offering a versatile solution for visualizing high-dimensional data while maintaining both local neighborhoods and global relationships.

Abstract: Dimensionality reduction techniques are widely used for visualizing high-dimensional data in two dimensions. Existing methods are typically designed to preserve either local (e.g. $t$-SNE, UMAP) or global (e.g. MDS, PCA) structure of the data, but none of the established methods can represent both aspects well. In this paper, we present DREAMS (Dimensionality Reduction Enhanced Across Multiple Scales), a method that combines the local structure preservation of $t$-SNE with the global structure preservation of PCA via a simple regularization term. Our approach generates a spectrum of embeddings between the locally well-structured $t$-SNE embedding and the globally well-structured PCA embedding, efficiently balancing both local and global structure preservation. We benchmark DREAMS across seven real-world datasets, including five from single-cell transcriptomics and one from population genetics, showcasing qualitatively and quantitatively its superior ability to preserve structure across multiple scales compared to previous approaches.

[317] Order Optimal Regret Bounds for Sharpe Ratio Optimization in the Bandit Setting

Mohammad Taha Shah, Sabrina Khurshid, Gourab Ghatak

Main category: cs.LG

TL;DR: Thompson Sampling algorithm for Sharpe ratio maximization in bandits achieves logarithmic regret with optimal performance bounds.

DetailsMotivation: Sharpe ratio optimization introduces risk-return tradeoff unlike conventional bandit objectives, requiring exploration of both mean and variance of rewards.

Method: Proposed SRTS algorithm based on Thompson Sampling with Gaussian rewards assumption, featuring novel regret decomposition for Sharpe ratio.

Result: Established logarithmic regret upper bound with matching lower bound proving order-optimality, outperforms existing algorithms in simulations.

Conclusion: Thompson Sampling is effective for Sharpe ratio maximization with theoretical guarantees and superior empirical performance.

Abstract: In this paper, we investigate the problem of sequential decision-making for Sharpe ratio (SR) maximization in a stochastic bandit setting. We focus on the Thompson Sampling (TS) algorithm, a Bayesian approach celebrated for its empirical performance and exploration efficiency, under the assumption of Gaussian rewards with unknown parameters. Unlike conventional bandit objectives focusing on maximizing cumulative reward, Sharpe ratio optimization instead introduces an inherent tradeoff between achieving high returns and controlling risk, demanding careful exploration of both mean and variance. Our theoretical contributions include a novel regret decomposition specifically designed for the Sharpe ratio, highlighting the role of information acquisition about the reward distribution in driving learning efficiency. Then, we establish fundamental performance limits for the proposed algorithm \texttt{SRTS} in terms of an upper bound on regret. We also derive the matching lower bound and show the order-optimality. Our results show that Thompson Sampling achieves logarithmic regret over time, with distribution-dependent factors capturing the difficulty of distinguishing arms based on risk-adjusted performance. Empirical simulations show that our algorithm significantly outperforms existing algorithms.

[318] Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Dongchun Xie, Yiwei Wang, Xiaodan Liang, Jing Tang

Main category: cs.LG

TL;DR: RLVR’s potential is limited by depth and breadth constraints. DARS addresses depth neglect by re-weighting hard problems, while large-breadth training enhances performance. DARS-B combines both for optimal results.

DetailsMotivation: Current RLVR approaches suffer from systematic bias where cumulative-advantage disproportionately weights medium-accuracy samples while neglecting low-accuracy instances crucial for pushing reasoning boundaries.

Method: Introduced Difficulty Adaptive Rollout Sampling (DARS) to re-weight hard problems through targeted multi-stage rollouts. Also scaled batch size aggressively and replaced PPO’s mini-batch iterations with full-batch updates over multiple epochs.

Result: DARS delivers consistent Pass@K gains without extra inference cost. Large-breadth training significantly enhances Pass@1 performance and sustains high token-level entropy. DARS-B shows simultaneous gains in both Pass@K and Pass@1.

Conclusion: Breadth and adaptive exploration across depth operate as orthogonal dimensions in RLVR, both being key to unleashing the reasoning power of reinforcement learning with verifiable reward.

Abstract: Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models, yet its full potential is hindered by two under-explored dimensions: Depth-the hardest problem a model can sample; Breadth-the number of instances consumed in a single iteration. We dissect the popular GRPO algorithm and reveal a systematic bias: the cumulative-advantage disproportionately weights samples with medium accuracy, while down-weighting the low-accuracy instances that are crucial for pushing reasoning boundaries. To rectify the depth neglect, we introduce Difficulty Adaptive Rollout Sampling (DARS), which re-weights hard problems through targeted multi-stage rollouts, thereby increasing the number of positive rollouts for hard problems. Empirically, naively enlarging rollout size only accelerates convergence and even hurts Pass@K. Our DARS, in contrast, delivers consistent Pass@K gains without extra inference cost at convergence. Just as we adaptively expanded the depth of exploration, we now ask whether aggressively scaling the breadth of training data can further amplify reasoning gains. To this end, we intensely scale batch size and replace PPO’s mini-batch iterations with full-batch updates over multiple epochs. Increasing breadth significantly enhances Pass@1 performance. Large-breadth training sustains high token-level entropy, indicating continued exploration and reduced gradient noise. We further present DARS-B, which augments DARS with large breadth, and demonstrate simultaneous gains in Pass@K and Pass@1. The results confirm that breadth and adaptive exploration across depth operate as orthogonal dimensions in RLVR, which are key to unleashing the reasoning power of RLVR.

[319] PENGUIN: Enhancing Transformer with Periodic-Nested Group Attention for Long-term Time Series Forecasting

Tian Sun, Yuqi Chen, Weiwei Sun

Main category: cs.LG

TL;DR: PENGUIN introduces a novel attention mechanism with periodic-nested relative attention bias and grouped multi-query attention to explicitly model multiple periodic patterns in time series, outperforming both MLP and Transformer models.

DetailsMotivation: Transformer-based models have shown breakthroughs in time series forecasting but their effectiveness remains debatable. The authors aim to revisit self-attention and better capture periodic patterns that are crucial for time series modeling.

Method: Proposes Periodic-Nested Group Attention (PENGUIN) with: 1) periodic-nested relative attention bias to capture periodic structures directly, 2) grouped attention mechanism where each group targets a specific periodicity using multi-query attention to handle multiple coexisting periodicities.

Result: Extensive experiments across diverse benchmarks demonstrate that PENGUIN consistently outperforms both MLP-based and Transformer-based models in long-term time series forecasting.

Conclusion: The proposed PENGUIN approach effectively models periodic patterns through specialized attention mechanisms, proving more effective than existing approaches for time series forecasting tasks.

Abstract: Long-term time series forecasting (LTSF) is a fundamental task with wide-ranging applications. Although Transformer-based models have made significant breakthroughs in forecasting, their effectiveness for time series forecasting remains debatable. In this paper, we revisit the significance of self-attention and propose a simple yet effective mechanism, Periodic-Nested Group Attention, namely PENGUIN. Our approach highlights the importance of explicitly modeling periodic patterns and incorporating relative attention bias for effective time series modeling. To this end, we introduce a periodic-nested relative attention bias that captures periodic structures directly. To handle multiple coexisting periodicities (e.g., daily and weekly cycles), we design a grouped attention mechanism, where each group targets a specific periodicity using a multi-query attention mechanism. Extensive experiments across diverse benchmarks demonstrate that PENGUIN consistently outperforms both MLP-based and Transformer-based models.

[320] Communication-Efficient Federated Learning with Adaptive Number of Participants

Sergey Skorik, Vladislav Dorofeev, Gleb Molodtsov, Aram Avetisyan, Dmitry Bylinkin, Daniil Medyakov, Aleksandr Beznosikov

Main category: cs.LG

TL;DR: ISP is an adaptive mechanism that dynamically selects the optimal number of clients per round in federated learning to improve communication efficiency by up to 30% without sacrificing model accuracy.

DetailsMotivation: Communication efficiency remains a key bottleneck in federated learning, especially under heterogeneous and dynamic client participation. Existing methods don't adequately address how to choose the optimal number of clients per training round.

Method: Intelligent Selection of Participants (ISP) - an adaptive mechanism that dynamically determines the optimal number of clients per round to enhance communication efficiency while maintaining model quality.

Result: ISP achieves consistent communication savings of up to 30% without compromising final model accuracy across diverse setups including vision transformers, real-world ECG classification, and training with gradient compression.

Conclusion: The selection of the number of clients should be treated as a separate and important task in federated learning, and ISP provides an effective adaptive solution for this challenge.

Abstract: Rapid scaling of deep learning models has enabled performance gains across domains, yet it introduced several challenges. Federated Learning (FL) has emerged as a promising framework to address these concerns by enabling decentralized training. Nevertheless, communication efficiency remains a key bottleneck in FL, particularly under heterogeneous and dynamic client participation. Existing methods, such as FedAvg and FedProx, or other approaches, including client selection strategies, attempt to mitigate communication costs. However, the problem of choosing the number of clients in a training round remains extremely underexplored. We introduce Intelligent Selection of Participants (ISP), an adaptive mechanism that dynamically determines the optimal number of clients per round to enhance communication efficiency without compromising model accuracy. We validate the effectiveness of ISP across diverse setups, including vision transformers, real-world ECG classification, and training with gradient compression. Our results show consistent communication savings of up to 30% without losing the final quality. Applying ISP to different real-world ECG classification setups highlighted the selection of the number of clients as a separate task of federated learning.

[321] Reinforcement Learning-based Adaptive Path Selection for Programmable Networks

José Eduardo Zerna Torres, Marios Avgeris, Chrysa Papagianni, Gergely Pongrácz, István Gódor, Paola Grosso

Main category: cs.LG

TL;DR: Distributed in-network reinforcement learning framework using Stochastic Learning Automata and In-Band Network Telemetry for adaptive path selection in programmable networks.

DetailsMotivation: To enable local, data-driven forwarding decisions that can dynamically adapt to network congestion conditions in programmable networks.

Method: Combines Stochastic Learning Automata (SLA) with real-time telemetry data from In-Band Network Telemetry (INT) for path selection. Implemented on Mininet-based testbed using P4-programmable BMv2 switches.

Result: The SLA-based mechanism converges to effective path selections and adapts to shifting network conditions at line rate.

Conclusion: Proof-of-concept demonstrates successful implementation of distributed in-network reinforcement learning for adaptive path selection in programmable networks.

Abstract: This work presents a proof-of-concept implementation of a distributed, in-network reinforcement learning (IN-RL) framework for adaptive path selection in programmable networks. By combining Stochastic Learning Automata (SLA) with real-time telemetry data collected via In-Band Network Telemetry (INT), the proposed system enables local, data-driven forwarding decisions that adapt dynamically to congestion conditions. The system is evaluated on a Mininet-based testbed using P4-programmable BMv2 switches, demonstrating how our SLA-based mechanism converges to effective path selections and adapts to shifting network conditions at line rate.

[322] Assessing Trustworthiness of AI Training Dataset using Subjective Logic – A Use Case on Bias

Koffi Ismael Ouattara, Ioannis Krontiris, Theo Dimitrakos, Frank Kargl

Main category: cs.LG

TL;DR: First formal framework for assessing AI training dataset trustworthiness using Subjective Logic, enabling uncertainty-aware evaluation of global properties like bias in both centralized and federated learning contexts.

DetailsMotivation: AI systems increasingly rely on training data, making dataset trustworthiness assessment critical, especially for properties like fairness and bias that emerge at the dataset level rather than individual data points.

Method: Built on Subjective Logic to support trust propositions and quantify uncertainty in scenarios with incomplete, distributed, and/or conflicting evidence. Framework instantiated specifically for bias assessment and evaluated on traffic sign recognition dataset.

Result: Method successfully captures class imbalance and remains interpretable and robust in both centralized and federated learning contexts, demonstrating effective uncertainty-aware evaluation of dataset-level properties.

Conclusion: Provides a formal, uncertainty-aware framework for assessing dataset trustworthiness properties like bias, addressing the gap in evaluating emergent dataset-level characteristics that individual data point assessments cannot capture.

Abstract: As AI systems increasingly rely on training data, assessing dataset trustworthiness has become critical, particularly for properties like fairness or bias that emerge at the dataset level. Prior work has used Subjective Logic to assess trustworthiness of individual data, but not to evaluate trustworthiness properties that emerge only at the level of the dataset as a whole. This paper introduces the first formal framework for assessing the trustworthiness of AI training datasets, enabling uncertainty-aware evaluations of global properties such as bias. Built on Subjective Logic, our approach supports trust propositions and quantifies uncertainty in scenarios where evidence is incomplete, distributed, and/or conflicting. We instantiate this framework on the trustworthiness property of bias, and we experimentally evaluate it based on a traffic sign recognition dataset. The results demonstrate that our method captures class imbalance and remains interpretable and robust in both centralized and federated contexts.

[323] Disentangled Deep Smoothed Bootstrap for Fair Imbalanced Regression

Samuel Stocksieker, Denys pommeret, Arthur Charpentier

Main category: cs.LG

TL;DR: Novel method combining disentangled VAE with smoothed bootstrap in latent space to address imbalanced regression problems in tabular data.

DetailsMotivation: Imbalanced distribution learning reduces performance of standard algorithms, with most approaches focused on classification rather than regression problems.

Method: Uses Variational Autoencoders (VAEs) to model latent data representations, combined with disentangled VAE and smoothed bootstrap applied in latent space for data generation.

Result: Evaluated through numerical comparisons with competitors on benchmark datasets for Imbalanced Regression.

Conclusion: Proposes an innovative approach to improve learning on imbalanced tabular data within the IR framework, addressing limitations of standard methods.

Abstract: Imbalanced distribution learning is a common and significant challenge in predictive modeling, often reducing the performance of standard algorithms. Although various approaches address this issue, most are tailored to classification problems, with a limited focus on regression. This paper introduces a novel method to improve learning on tabular data within the Imbalanced Regression (IR) framework, which is a critical problem. We propose using Variational Autoencoders (VAEs) to model and define a latent representation of data distributions. However, VAEs can be inefficient with imbalanced data like other standard approaches. To address this, we develop an innovative data generation method that combines a disentangled VAE with a Smoothed Bootstrap applied in the latent space. We evaluate the efficiency of this method through numerical comparisons with competitors on benchmark datasets for IR.

[324] One Shot vs. Iterative: Rethinking Pruning Strategies for Model Compression

Mikołaj Janusz, Tomasz Wojnar, Yawei Li, Luca Benini, Kamil Adamczewski

Main category: cs.LG

TL;DR: Systematic comparison shows one-shot pruning better at low ratios, iterative pruning superior at high ratios, leading to proposed hybrid approach

DetailsMotivation: To provide rigorous empirical comparison between one-shot and iterative pruning methods, challenging the assumed preference for iterative pruning without systematic testing

Method: Comprehensive benchmarking across structured and unstructured pruning settings with different pruning criteria and modalities, followed by development of a hybrid approach

Result: Each method has specific advantages: one-shot pruning more effective at lower pruning ratios, iterative pruning performs better at higher ratios

Conclusion: Advocates for patience-based pruning and introduces hybrid approach that can outperform traditional methods, providing practical guidance for pruning strategy selection

Abstract: Pruning is a core technique for compressing neural networks to improve computational efficiency. This process is typically approached in two ways: one-shot pruning, which involves a single pass of training and pruning, and iterative pruning, where pruning is performed over multiple cycles for potentially finer network refinement. Although iterative pruning has historically seen broader adoption, this preference is often assumed rather than rigorously tested. Our study presents one of the first systematic and comprehensive comparisons of these methods, providing rigorous definitions, benchmarking both across structured and unstructured settings, and applying different pruning criteria and modalities. We find that each method has specific advantages: one-shot pruning proves more effective at lower pruning ratios, while iterative pruning performs better at higher ratios. Building on these findings, we advocate for patience-based pruning and introduce a hybrid approach that can outperform traditional methods in certain scenarios, providing valuable insights for practitioners selecting a pruning strategy tailored to their goals and constraints. Source code is available at https://github.com/janumiko/pruning-benchmark.

[325] FedUP: Efficient Pruning-based Federated Unlearning for Model Poisoning Attacks

Nicolò Romandini, Cristian Borcea, Rebecca Montanari, Luca Foschini

Main category: cs.LG

TL;DR: FedUP is a lightweight federated unlearning algorithm that efficiently removes malicious clients’ influence by pruning specific model connections using only the last training round weights, achieving effective unlearning while being faster and more storage-efficient than state-of-the-art solutions.

DetailsMotivation: Federated Learning is vulnerable to model poisoning attacks, and existing Federated Unlearning approaches assume trusted clients. However, when dealing with malicious and potentially colluding clients who won't cooperate in unlearning, a new approach is needed to selectively remove their influence without complete retraining.

Method: FedUP identifies and zeroes the highest magnitude weights that diverge most between benign and malicious clients’ latest updates. It prunes specific connections in the attacked model using only clients’ weights from the last training round before unlearning, carefully preserving benign information while isolating malicious influence.

Result: FedUP effectively reduces malicious influence, lowering accuracy on malicious data to match models retrained from scratch while preserving performance on benign data. It works under strong adversarial conditions (up to 50% malicious clients with full knowledge of aggregation) across IID and Non-IID data, and against label-flipping and backdoor attacks.

Conclusion: FedUP provides an efficient, robust solution for federated unlearning with malicious clients, consistently outperforming state-of-the-art methods in speed and storage efficiency while effectively mitigating poisoning attacks without requiring complete model retraining.

Abstract: Federated Learning (FL) can be vulnerable to attacks, such as model poisoning, where adversaries send malicious local weights to compromise the global model. Federated Unlearning (FU) is emerging as a solution to address such vulnerabilities by selectively removing the influence of detected malicious contributors on the global model without complete retraining. However, unlike typical FU scenarios where clients are trusted and cooperative, applying FU with malicious and possibly colluding clients is challenging because their collaboration in unlearning their data cannot be assumed. This work presents FedUP, a lightweight FU algorithm designed to efficiently mitigate malicious clients’ influence by pruning specific connections within the attacked model. Our approach achieves efficiency by relying only on clients’ weights from the last training round before unlearning to identify which connections to inhibit. Isolating malicious influence is non-trivial due to overlapping updates from benign and malicious clients. FedUP addresses this by carefully selecting and zeroing the highest magnitude weights that diverge the most between the latest updates from benign and malicious clients while preserving benign information. FedUP is evaluated under a strong adversarial threat model, where up to 50%-1 of the clients could be malicious and have full knowledge of the aggregation process. We demonstrate the effectiveness, robustness, and efficiency of our solution through experiments across IID and Non-IID data, under label-flipping and backdoor attacks, and by comparing it with state-of-the-art (SOTA) FU solutions. In all scenarios, FedUP reduces malicious influence, lowering accuracy on malicious data to match that of a model retrained from scratch while preserving performance on benign data. FedUP achieves effective unlearning while consistently being faster and saving storage compared to the SOTA.

[326] A Comprehensive Re-Evaluation of Biometric Modality Properties in the Modern Era

Rouqaiah Al-Refai, Pankaja Priya Ramasamy, Ragini Ramesh, Patricia Arias-Cabarcos, Philipp Terhörst

Main category: cs.LG

TL;DR: This paper presents an updated framework for evaluating biometric modalities through expert surveys, revealing significant shifts in property ratings compared to the outdated 1998 framework and showing strong alignment with empirical dataset analysis.

DetailsMotivation: The existing 1998 evaluation framework for biometric modalities is outdated and fails to capture recent technological developments and emerging vulnerabilities in biometric systems, necessitating a modern assessment approach.

Method: Conducted an expert survey with 24 biometric specialists to evaluate biometric modalities, analyzed expert agreement levels, and compared expert assessments with dataset-level uncertainty across 55 biometric datasets.

Result: Substantial shifts in property ratings were found - face recognition improved due to technological progress while fingerprint reliability decreased due to emerging vulnerabilities. Strong alignment was observed between expert evaluations and empirical dataset evidence.

Conclusion: The study provides a reliable updated evaluation framework, highlights key open challenges through expert disagreements, and demonstrates the importance of integrating expert insight with empirical evidence to guide future biometric research.

Abstract: The rapid advancement of authentication systems and their increasing reliance on biometrics for faster and more accurate user verification experience, highlight the critical need for a reliable framework to evaluate the suitability of biometric modalities for specific applications. Currently, the most widely known evaluation framework is a comparative table from 1998, which no longer adequately captures recent technological developments or emerging vulnerabilities in biometric systems. To address these challenges, this work revisits the evaluation of biometric modalities through an expert survey involving 24 biometric specialists. The findings indicate substantial shifts in property ratings across modalities. For example, face recognition, shows improved ratings due to technological progress, while fingerprint, shows decreased reliability because of emerging vulnerabilities and attacks. Further analysis of expert agreement levels across rated properties highlighted the consistency of the provided evaluations and ensured the reliability of the ratings. Finally, expert assessments are compared with dataset-level uncertainty across 55 biometric datasets, revealing strong alignment in most modalities and underscoring the importance of integrating empirical evidence with expert insight. Moreover, the identified expert disagreements reveal key open challenges and help guide future research toward resolving them.

[327] Fisher-Orthogonal Projection Methods for Natural Gradient Descent with Large Batches

Yishun Lu, Wesley Armour

Main category: cs.LG

TL;DR: FOP enables effective second-order optimization at very large batch sizes by constructing variance-aware update directions using orthogonal gradient projections under the Fisher metric.

DetailsMotivation: Existing optimizers struggle with large batch sizes - first-order methods lose escape ability from poor minima, while second-order methods require excessive damping that washes out curvature information.

Method: Fisher-Orthogonal Projection (FOP) leverages gradients from two sub-batches to enhance the average gradient with an orthogonal component of the gradient difference under the Fisher-metric.

Result: FOP restores second-order method effectiveness at very large batch sizes, enabling scalable training with improved generalization and faster convergence.

Conclusion: FOP provides a novel solution to overcome optimization challenges at extreme batch sizes by maintaining curvature information through orthogonal gradient projections.

Abstract: Modern GPUs are equipped with large amounts of high-bandwidth memory, enabling them to support mini-batch sizes of up to tens of thousands of training samples. However, most existing optimizers struggle to perform effectively at such a large batch size. As batch size increases, gradient noise decreases due to averaging over many samples, limiting the ability of first-order methods to escape sharp or suboptimal minima and reach the global minimum. Meanwhile, second-order methods like the natural gradient with Kronecker-Factored Approximate Curvature (KFAC) often require excessively high damping to remain stable at large batch sizes. This high damping effectively washes out the curvature information that gives these methods their advantage, reducing their performance to that of simple gradient descent. In this paper, we introduce Fisher-Orthogonal Projection (FOP), a novel technique that restores the effectiveness of the second-order method at very large batch sizes, enabling scalable training with improved generalization and faster convergence. FOP constructs a variance-aware update direction by leveraging gradients from two sub-batches, enhancing the average gradient with a component of the gradient difference that is orthogonal to the average under the Fisher-metric.

[328] Revisiting Diffusion Q-Learning: From Iterative Denoising to One-Step Action Generation

Thanh Nguyen, Chang D. Yoo

Main category: cs.LG

TL;DR: OFQL enables efficient one-step action generation in offline RL by reformulating Diffusion Q-Learning with Flow Matching, eliminating multi-step denoising while improving performance and speed.

DetailsMotivation: Diffusion Q-Learning (DQL) achieves strong results but is limited by multi-step denoising requirements during training and inference, which is computationally expensive. One-step denoising in DQL causes performance degradation.

Method: Proposes One-Step Flow Q-Learning (OFQL) that reformulates DQL within the Flow Matching framework. Learns an average velocity field instead of curved generative trajectories to enable direct one-step action generation without auxiliary models or distillation.

Result: Extensive experiments on D4RL benchmark show OFQL outperforms DQL and other diffusion-based baselines while substantially reducing both training and inference time compared to DQL.

Conclusion: OFQL successfully addresses DQL’s limitations by enabling efficient one-step generation through Flow Matching, achieving better performance with significantly reduced computational costs.

Abstract: The generative power of diffusion models (DMs) has recently enabled high-performing decision-making algorithms in offline reinforcement learning (RL), achieving state-of-the-art results across standard benchmarks. Among them, Diffusion Q-Learning (DQL) stands out as a leading method for its consistently strong performance. Nevertheless, DQL remains limited in practice due to its reliance on multi-step denoising for action generation during both training and inference. Although one-step denoising is desirable, simply applying it to DQL leads to a drastic performance drop. In this work, we revisit DQL and identify its core limitations. We then propose One-Step Flow Q-Learning (OFQL), a novel framework that enables efficient one-step action generation during both training and inference, without requiring auxiliary models, distillation, or multi-phase training. Specifically, OFQL reformulates DQL within the sample-efficient Flow Matching (FM) framework. While conventional FM induces curved generative trajectories that impede one-step generation, OFQL instead learns an average velocity field that facilitates direct, accurate action generation. Collectively, OFQL eliminates the need for multi-step sampling and recursive gradient updates in DQL, resulting in faster and more robust training and inference. Extensive experiments on the D4RL benchmark demonstrate that OFQL outperforms DQL and other diffusion-based baselines, while substantially reducing both training and inference time compared to DQL.

[329] Automated Energy-Aware Time-Series Model Deployment on Embedded FPGAs for Resilient Combined Sewer Overflow Management

Tianheng Ling, Vipin Singh, Chao Qian, Felix Biessmann, Gregor Schiele

Main category: cs.LG

TL;DR: Edge-based AI framework for sewer overflow forecasting using compressed Transformer/LSTM models on FPGAs, achieving energy-efficient local inference without cloud dependency.

DetailsMotivation: Climate change intensifies extreme weather events, challenging aging sewer systems and increasing overflow risks. Cloud-based AI solutions are unreliable during communication outages, necessitating local edge computing solutions.

Method: Lightweight Transformer and LSTM models compressed via integer-only quantization, deployed on AMD Spartan-7 FPGA through automated hardware-aware pipeline that optimizes for both prediction accuracy and energy consumption.

Result: 8-bit Transformer achieved MSE 0.0376 at 0.370 mJ per inference, while 8-bit LSTM used 0.009 mJ (40x less energy) but had 14.89% worse accuracy (MSE 0.0432) and longer training time.

Conclusion: Trade-off between energy efficiency and accuracy enables model selection based on deployment priorities - LSTM for ultra-low energy consumption, Transformer for higher accuracy. Enables resilient local forecasting for sewer systems.

Abstract: Extreme weather events, intensified by climate change, increasingly challenge aging combined sewer systems, raising the risk of untreated wastewater overflow. Accurate forecasting of sewer overflow basin filling levels can provide actionable insights for early intervention, helping mitigating uncontrolled discharge. In recent years, AI-based forecasting methods have offered scalable alternatives to traditional physics-based models, but their reliance on cloud computing limits their reliability during communication outages. To address this, we propose an end-to-end forecasting framework that enables energy-efficient inference directly on edge devices. Our solution integrates lightweight Transformer and Long Short-Term Memory (LSTM) models, compressed via integer-only quantization for efficient on-device execution. Moreover, an automated hardware-aware deployment pipeline is used to search for optimal model configurations by jointly minimizing prediction error and energy consumption on an AMD Spartan-7 XC7S15 FPGA. Evaluated on real-world sewer data, the selected 8-bit Transformer model, trained on 24 hours of historical measurements, achieves high accuracy (MSE 0.0376) at an energy cost of 0.370 mJ per inference. In contrast, the optimal 8-bit LSTM model requires significantly less energy (0.009 mJ, over 40x lower) but yields 14.89% worse accuracy (MSE 0.0432) and much longer training time. This trade-off highlights the need to align model selection with deployment priorities, favoring LSTM for ultra-low energy consumption or Transformer for higher predictive accuracy. In general, our work enables local, energy-efficient forecasting, contributing to more resilient combined sewer systems. All code can be found in the GitHub Repository (https://github.com/tianheng-ling/EdgeOverflowForecast).

[330] Categorical Policies: Multimodal Policy Learning and Exploration in Continuous Control

SM Mazharul Islam, Manfred Huber

Main category: cs.LG

TL;DR: Categorical Policies replace standard Gaussian policies in deep RL with multimodal categorical distributions for better exploration and performance in continuous control tasks.

DetailsMotivation: Standard Gaussian policies in deep RL are limited to unimodal behavior, which hinders exploration and performance in complex environments with sparse rewards, complex dynamics, or varying contexts.

Method: Introduces Categorical Policies that use an intermediate categorical distribution to model multimodal behavior modes, with differentiable sampling schemes that maintain efficient gradient-based optimization while enabling discrete latent structure.

Result: The multimodal policies converge faster and outperform standard Gaussian policies on DeepMind Control Suite environments through better exploration capabilities.

Conclusion: Categorical distributions serve as a powerful tool for structured exploration and multimodal behavior representation in continuous control, addressing limitations of traditional Gaussian policies.

Abstract: A policy in deep reinforcement learning (RL), either deterministic or stochastic, is commonly parameterized as a Gaussian distribution alone, limiting the learned behavior to be unimodal. However, the nature of many practical decision-making problems favors a multimodal policy that facilitates robust exploration of the environment and thus to address learning challenges arising from sparse rewards, complex dynamics, or the need for strategic adaptation to varying contexts. This issue is exacerbated in continuous control domains where exploration usually takes place in the vicinity of the predicted optimal action, either through an additive Gaussian noise or the sampling process of a stochastic policy. In this paper, we introduce Categorical Policies to model multimodal behavior modes with an intermediate categorical distribution, and then generate output action that is conditioned on the sampled mode. We explore two sampling schemes that ensure differentiable discrete latent structure while maintaining efficient gradient-based optimization. By utilizing a latent categorical distribution to select the behavior mode, our approach naturally expresses multimodality while remaining fully differentiable via the sampling tricks. We evaluate our multimodal policy on a set of DeepMind Control Suite environments, demonstrating that through better exploration, our learned policies converge faster and outperform standard Gaussian policies. Our results indicate that the Categorical distribution serves as a powerful tool for structured exploration and multimodal behavior representation in continuous control.

[331] How Usable is Automated Feature Engineering for Tabular Data?

Bastian Schäfer, Lennart Purucker, Maciej Janowski, Frank Hutter

Main category: cs.LG

TL;DR: Survey of 53 automated feature engineering methods reveals they are generally hard to use, poorly documented, lack community support, and don’t allow time/memory constraints, highlighting the need for more usable AutoFE solutions.

DetailsMotivation: Automated feature engineering (AutoFE) is essential for machine learning performance but manual feature engineering is expensive and time-consuming. However, existing AutoFE methods have never been evaluated for practical usability.

Method: Conducted a comprehensive survey and analysis of 53 different automated feature engineering methods to assess their usability, documentation quality, community support, and practical constraints.

Result: Found that AutoFE methods are generally difficult to use, lack proper documentation, have no active user communities, and critically lack time and memory constraint settings that are necessary for practical deployment.

Conclusion: There is a significant need for future development of more usable, well-engineered automated feature engineering methods that address practical constraints and provide better user support.

Abstract: Tabular data, consisting of rows and columns, is omnipresent across various machine learning applications. Each column represents a feature, and features can be combined or transformed to create new, more informative features. Such feature engineering is essential to achieve peak performance in machine learning. Since manual feature engineering is expensive and time-consuming, a substantial effort has been put into automating it. Yet, existing automated feature engineering (AutoFE) methods have never been investigated regarding their usability for practitioners. Thus, we investigated 53 AutoFE methods. We found that these methods are, in general, hard to use, lack documentation, and have no active communities. Furthermore, no method allows users to set time and memory constraints, which we see as a necessity for usable automation. Our survey highlights the need for future work on usable, well-engineered AutoFE methods.

[332] Convergent Reinforcement Learning Algorithms for Stochastic Shortest Path Problem

Soumyajit Guin, Shalabh Bhatnagar

Main category: cs.LG

TL;DR: Novel algorithms for Stochastic Shortest Path problems in both tabular and function approximation settings with proven convergence and superior performance.

DetailsMotivation: SSP problems are fundamental in Reinforcement Learning as they can represent various cost-criteria, but existing algorithms need improvement in convergence guarantees and performance.

Method: Proposed two tabular algorithms and one function approximation algorithm specifically designed for Stochastic Shortest Path problems with asymptotic almost-sure convergence proofs.

Result: Tabular algorithms outperformed other convergent RL algorithms, and the function approximation algorithm demonstrated reliable performance compared to existing methods in that setting.

Conclusion: The proposed algorithms provide effective solutions for SSP problems with strong theoretical convergence guarantees and practical performance advantages in both tabular and function approximation scenarios.

Abstract: In this paper we propose two algorithms in the tabular setting and an algorithm for the function approximation setting for the Stochastic Shortest Path (SSP) problem. SSP problems form an important class of problems in Reinforcement Learning (RL), as other types of cost-criteria in RL can be formulated in the setting of SSP. We show asymptotic almost-sure convergence for all our algorithms. We observe superior performance of our tabular algorithms compared to other well-known convergent RL algorithms. We further observe reliable performance of our function approximation algorithm compared to other algorithms in the function approximation setting.

[333] AutoScale: Linear Scalarization Guided by Multi-Task Optimization Metrics

Yi Yang, Kei Ikemura, Qingwen Zhang, Xiaomeng Zhu, Ci Li, Nazre Batool, Sina Sharif Mansouri, John Folkesson

Main category: cs.LG

TL;DR: AutoScale uses multi-task optimization metrics to automatically find optimal weights for linear scalarization, eliminating expensive hyperparameter search while achieving superior performance.

DetailsMotivation: Linear scalarization with fixed task weights can match complex MTO methods, but determining optimal weights requires exhaustive search. The paper aims to understand why certain weights work well and how to find them efficiently.

Method: Two-phase AutoScale framework that uses MTO metrics (like gradient magnitude similarity) to guide weight selection for linear scalarization without expensive search.

Result: AutoScale consistently achieves superior performance with high efficiency across diverse datasets, including a new large-scale benchmark.

Conclusion: The study establishes a connection between linear scalarization and MTO methods, showing that well-performing weights exhibit specific MTO metric trends, enabling efficient automated weight selection.

Abstract: Recent multi-task learning studies suggest that linear scalarization, when using well-chosen fixed task weights, can achieve comparable to or even better performance than complex multi-task optimization (MTO) methods. It remains unclear why certain weights yield optimal performance and how to determine these weights without relying on exhaustive hyperparameter search. This paper establishes a direct connection between linear scalarization and MTO methods, revealing through extensive experiments that well-performing scalarization weights exhibit specific trends in key MTO metrics, such as high gradient magnitude similarity. Building on this insight, we introduce AutoScale, a simple yet effective two-phase framework that uses these MTO metrics to guide weight selection for linear scalarization, without expensive weight search. AutoScale consistently shows superior performance with high efficiency across diverse datasets including a new large-scale benchmark.

[334] Multi-User Contextual Cascading Bandits for Personalized Recommendation

Jiho Park, Huiwen Jia

Main category: cs.LG

TL;DR: A new combinatorial bandit framework called Multi-User Contextual Cascading Bandit (MCCB) that models realistic online advertising with multiple users interacting with sequential items simultaneously, featuring two algorithms with improved regret bounds.

DetailsMotivation: To capture realistic online advertising scenarios where multiple users interact with sequentially displayed items simultaneously, addressing limitations of classical contextual bandits by integrating cascading feedback, parallel context sessions, and heterogeneous rewards.

Method: Proposed two algorithms: Upper Confidence Bound with Backward Planning (UCBBP) and Active Upper Confidence Bound with Backward Planning (AUCBBP), both designed for the MCCB framework with cascading feedback and parallel user sessions.

Result: UCBBP achieves regret bound of O~(√THN) and AUCBBP shows strict efficiency improvement with regret bound of O~(√T+HN), validated through numerical experiments demonstrating empirical effectiveness.

Conclusion: The MCCB framework successfully models multi-user sequential interactions in online advertising, with both proposed algorithms providing strong theoretical guarantees and practical performance improvements over classical approaches.

Abstract: We introduce a Multi-User Contextual Cascading Bandit model, a new combinatorial bandit framework that captures realistic online advertising scenarios where multiple users interact with sequentially displayed items simultaneously. Unlike classical contextual bandits, MCCB integrates three key structural elements: (i) cascading feedback based on sequential arm exposure, (ii) parallel context sessions enabling selective exploration, and (iii) heterogeneous arm-level rewards. We first propose Upper Confidence Bound with Backward Planning (UCBBP), a UCB-style algorithm tailored to this setting, and prove that it achieves a regret bound of $\widetilde{O}(\sqrt{THN})$ over $T$ episodes, $H$ session steps, and $N$ contexts per episode. Motivated by the fact that many users interact with the system simultaneously, we introduce a second algorithm, termed Active Upper Confidence Bound with Backward Planning (AUCBBP), which shows a strict efficiency improvement in context scaling, i.e., user scaling, with a regret bound of $\widetilde{O}(\sqrt{T+HN})$. We validate our theoretical findings via numerical experiments, demonstrating the empirical effectiveness of both algorithms under various settings.

[335] Quiet Feature Learning in Algorithmic Tasks

Prudhviraj Naidu, Zixian Wang, Leon Bergen, Ramamohan Paturi

Main category: cs.LG

TL;DR: Transformer models show hidden algorithmic learning phases where internal representations develop before visible loss improvements, challenging cross-entropy as a complete training metric.

DetailsMotivation: To understand why Transformer language models exhibit unexpected phase transitions in loss curves during algorithmic task training, and to investigate what learning occurs during apparently flat loss periods.

Method: Trained Transformer-based language models on 10 foundational algorithmic tasks, analyzed loss curves, probed internal representations to identify ‘quiet features’, and conducted ablation experiments to test causal necessity.

Result: Discovered pronounced phase transitions where validation loss remains flat then abruptly drops. Identified ‘quiet features’ - intermediate algorithmic computations learned during flat loss periods that are causally necessary for task performance but don’t immediately improve output loss.

Conclusion: Substantial representational progress can occur hidden beneath flat loss curves, challenging cross-entropy as a reliable proxy for learning and motivating development of richer training diagnostics.

Abstract: We train Transformer-based language models on ten foundational algorithmic tasks and observe pronounced phase transitions in their loss curves that deviate from established power-law scaling trends. Over large ranges of compute, the validation loss barely improves, then abruptly decreases. Probing the models’ internal representations reveals that quiet features are learned prior to any decrease in task loss. These quiet features represent intermediate algorithmic computations that do not by themselves improve the output loss. Ablation experiments demonstrate that individual quiet features are causally necessary for task performance. Our results demonstrate that substantial representational progress can remain hidden beneath an apparently flat loss curve, challenging the prevailing use of cross-entropy as a proxy for learning and motivating richer diagnostics for monitoring model training.

[336] Formal Algorithms for Model Efficiency

Naman Tyagi, Srishti Das, Kunal, Vatsal Gupta

Main category: cs.LG

TL;DR: The Knob-Meter-Rule (KMR) framework provides a unified mathematical formalism for representing and reasoning about diverse model efficiency techniques like pruning, quantization, and knowledge distillation through consistent knobs, rules, and meters.

DetailsMotivation: To address the fragmentation in model efficiency research by creating a unified framework that can systematically represent and combine diverse efficiency techniques, enabling better composition, analysis, and optimization.

Method: Abstracts efficiency methods into three components: controllable knobs (parameters to adjust), deterministic rules (how to apply changes), and measurable meters (metrics to evaluate). Introduces Budgeted-KMR algorithm for iterative optimization and provides algorithmic templates for instantiating known methods.

Result: Demonstrates that well-known efficiency methods can be instantiated as KMR triples, reveals underlying relationships between different techniques, and enables systematic composition of multiple efficiency methods into hybrid pipelines.

Conclusion: KMR offers both conceptual unification and practical tools for advancing model efficiency research, providing foundations for automated policy learning, dynamic adaptation, and theoretical analysis of cost-quality trade-offs.

Abstract: We introduce the Knob-Meter-Rule (KMR) framework, a unified formalism for representing and reasoning about model efficiency techniques in deep learning. By abstracting diverse methods, including pruning, quantization, knowledge distillation, and parameter-efficient architectures, into a consistent set of controllable knobs, deterministic rules, and measurable meters, KMR provides a mathematically precise and modular perspective on efficiency optimization. The framework enables systematic composition of multiple techniques, flexible policy-driven application, and iterative budgeted optimization through the Budgeted-KMR algorithm. We demonstrate how well-known efficiency methods can be instantiated as KMR triples and present concise algorithmic templates for each. The framework highlights underlying relationships between methods, facilitates hybrid pipelines, and lays the foundation for future research in automated policy learning, dynamic adaptation, and theoretical analysis of cost-quality trade-offs. Overall, KMR offers both a conceptual and practical tool for unifying and advancing model efficiency research.

[337] GDNSQ: Gradual Differentiable Noise Scale Quantization for Low-bit Neural Networks

Sergey Salishev, Ian Akhremchik

Main category: cs.LG

TL;DR: A novel quantization method using differentiable STE with learnable parameters achieves competitive accuracy even at extreme W1A1 quantization while maintaining efficiency.

DetailsMotivation: Quantization reduces neural network capacity as bit-width decreases, creating bottlenecks that limit performance in low-bit settings.

Method: Fully differentiable Straight-Through Estimator with learnable bit-width, noise scale and clamp bounds, using exterior-point penalty for target bit-width and metric smoothing via distillation for stability.

Result: Achieves competitive accuracy down to extreme W1A1 quantization setting while retaining the efficiency of STE-based methods.

Conclusion: The proposed differentiable quantization approach effectively addresses quantization bottlenecks and enables high performance even at extremely low bit-widths with maintained training efficiency.

Abstract: Quantized neural networks can be viewed as a chain of noisy channels, where rounding in each layer reduces capacity as bit-width shrinks; the floating-point (FP) checkpoint sets the maximum input rate. We track capacity dynamics as the average bit-width decreases and identify resulting quantization bottlenecks by casting fine-tuning as a smooth, constrained optimization problem. Our approach employs a fully differentiable Straight-Through Estimator (STE) with learnable bit-width, noise scale and clamp bounds, and enforces a target bit-width via an exterior-point penalty; mild metric smoothing (via distillation) stabilizes training. Despite its simplicity, the method attains competitive accuracy down to the extreme W1A1 setting while retaining the efficiency of STE.

[338] ASDFormer: A Transformer with Mixtures of Pooling-Classifier Experts for Robust Autism Diagnosis and Biomarker Discovery

Mohammad Izadi, Mehran Safayani

Main category: cs.LG

TL;DR: ASDFormer: Transformer-based model with Mixture of Experts for ASD diagnosis using fMRI connectivity patterns, achieving state-of-the-art accuracy and interpretable biomarker discovery.

DetailsMotivation: Autism Spectrum Disorder involves disrupted brain connectivity patterns that fMRI can capture. Current methods need better ways to identify ASD-related connectivity alterations within and between functional brain communities for improved diagnosis and biomarker discovery.

Method: ASDFormer architecture combines Transformer-based design with Mixture of Pooling-Classifier Experts (MoE). Uses attention mechanisms to adaptively emphasize different brain regions and connectivity patterns relevant to autism through multiple specialized expert branches.

Result: Achieves state-of-the-art diagnostic accuracy on the ABIDE dataset. Provides interpretable identification of disorder-related biomarkers and reveals robust insights into functional connectivity disruptions linked to ASD.

Conclusion: ASDFormer demonstrates strong potential as both a diagnostic tool and biomarker discovery platform for Autism Spectrum Disorder, effectively capturing neural signatures through its Transformer-MoE architecture.

Abstract: Autism Spectrum Disorder (ASD) is a complex neurodevelopmental condition marked by disruptions in brain connectivity. Functional MRI (fMRI) offers a non-invasive window into large-scale neural dynamics by measuring blood-oxygen-level-dependent (BOLD) signals across the brain. These signals can be modeled as interactions among Regions of Interest (ROIs), which are grouped into functional communities based on their underlying roles in brain function. Emerging evidence suggests that connectivity patterns within and between these communities are particularly sensitive to ASD-related alterations. Effectively capturing these patterns and identifying interactions that deviate from typical development is essential for improving ASD diagnosis and enabling biomarker discovery. In this work, we introduce ASDFormer, a Transformer-based architecture that incorporates a Mixture of Pooling-Classifier Experts (MoE) to capture neural signatures associated with ASD. By integrating multiple specialized expert branches with attention mechanisms, ASDFormer adaptively emphasizes different brain regions and connectivity patterns relevant to autism. This enables both improved classification performance and more interpretable identification of disorder-related biomarkers. Applied to the ABIDE dataset, ASDFormer achieves state-of-the-art diagnostic accuracy and reveals robust insights into functional connectivity disruptions linked to ASD, highlighting its potential as a tool for biomarker discovery.

[339] Typed Topological Structures Of Datasets

Wanjun Hu

Main category: cs.LG

TL;DR: Introduces typed topological spaces with special type sets for datasets, enabling structural analysis through tracks, components, and pseudotrees for applications like convex hull calculation and clustering.

DetailsMotivation: To provide a new topological perspective for analyzing finite datasets beyond statistical and algebraic topological methods, focusing on inner structural organization.

Method: Develops typed topological spaces where open sets have assigned types, creating a natural quotient space that organizes datasets into ordered tracks and components represented by integer sequences.

Result: Dataset structures can be represented as typed-II pseudotrees showing component relationships across tracks, enabling new algorithms for geometric and clustering problems.

Conclusion: Typed topology offers a powerful framework for structural analysis of datasets with applications in computational geometry and pattern recognition.

Abstract: A datatset $X$ on $R^2$ is a finite topological space. Current research of a dataset focuses on statistical methods and the algebraic topological method \cite{carlsson}. In \cite{hu}, the concept of typed topological space was introduced and showed to have the potential for studying finite topological spaces, such as a dataset. It is a new method from the general topology perspective. A typed topological space is a topological space whose open sets are assigned types. Topological concepts and methods can be redefined using open sets of certain types. In this article, we develop a special set of types and its related typed topology on a dataset $X$. Using it, we can investigate the inner structure of $X$. In particular, $R^2$ has a natural quotient space, in which $X$ is organized into tracks, and each track is split into components. Those components are in a order. Further, they can be represented by an integer sequence. Components crossing tracks form branches, and the relationship can be well represented by a type of pseudotree (called typed-II pseudotree). Such structures provide a platform for new algorithms for problems such as calculating convex hull, holes, clustering and anomaly detection.

[340] Efficient Knowledge Graph Unlearning with Zeroth-order Information

Yang Xiao, Ruimeng Ye, Bohan Liu, Xiaolong Ma, Bo Hui

Main category: cs.LG

TL;DR: Efficient knowledge graph unlearning algorithm using Taylor expansion and Fisher matrices to approximate parameter changes without expensive derivative computations, outperforming state-of-the-art methods.

DetailsMotivation: Growing demand for machine unlearning due to regulations like Right to be Forgotten, with KG unlearning being particularly challenging due to KG structure and semantic relations between entities.

Method: Define influence function for KG unlearning, use Taylor expansion to estimate parameter changes, approximate inverse-Hessian vector product using Fisher matrices and zeroth-order optimization without constructing computational graphs.

Result: Outperforms other state-of-the-art graph unlearning baselines significantly in both unlearning efficiency and unlearning quality.

Conclusion: Proposed method provides an efficient and effective solution for knowledge graph unlearning that avoids expensive computational overhead while maintaining high unlearning quality.

Abstract: Due to regulations like the Right to be Forgotten, there is growing demand for removing training data and its influence from models. Since full retraining is costly, various machine unlearning methods have been proposed. In this paper, we firstly present an efficient knowledge graph (KG) unlearning algorithm. We remark that KG unlearning is nontrivial due to the distinctive structure of KG and the semantic relations between entities. Also, unlearning by estimating the influence of removed components incurs significant computational overhead when applied to large-scale knowledge graphs. To this end, we define an influence function for KG unlearning and propose to approximate the model’s sensitivity without expensive computation of first-order and second-order derivatives for parameter updates. Specifically, we use Taylor expansion to estimate the parameter changes caused by data removal. Given that the first-order gradients and second-order derivatives dominate the computational load, we use the Fisher matrices and zeroth-order optimization to approximate the inverse-Hessian vector product without constructing the computational graphs. Our experimental results demonstrate that the proposed method outperforms other state-of-the-art graph unlearning baselines significantly in terms of unlearning efficiency and unlearning quality. Our code is released at https://github.com/NKUShaw/ZOWFKGIF.

[341] BLIPs: Bayesian Learned Interatomic Potentials

Dario Coscia, Pim de Haan, Max Welling

Main category: cs.LG

TL;DR: BLIPs is a Bayesian framework for machine learning interatomic potentials that provides well-calibrated uncertainty estimates and improved accuracy, especially in data-scarce or out-of-distribution scenarios.

DetailsMotivation: Standard MLIPs struggle with out-of-distribution data, data-scarce regimes, and lack uncertainty estimates needed for active learning and ensuring simulation accuracy compared to quantum calculations.

Method: BLIP uses a scalable, architecture-agnostic variational Bayesian framework built on adaptive Variational Dropout, integrating seamlessly with equivariant message-passing architectures.

Result: Empirical results show improved predictive accuracy over standard MLIPs, trustworthy uncertainty estimates in data-scarce and out-of-distribution regimes, and consistent performance gains when fine-tuning pretrained models.

Conclusion: BLIP provides a practical solution for uncertainty-aware MLIPs with minimal computational overhead, enabling more reliable simulation-based chemistry with calibrated uncertainty estimates.

Abstract: Machine Learning Interatomic Potentials (MLIPs) are becoming a central tool in simulation-based chemistry. However, like most deep learning models, MLIPs struggle to make accurate predictions on out-of-distribution data or when trained in a data-scarce regime, both common scenarios in simulation-based chemistry. Moreover, MLIPs do not provide uncertainty estimates by construction, which are fundamental to guide active learning pipelines and to ensure the accuracy of simulation results compared to quantum calculations. To address this shortcoming, we propose BLIPs: Bayesian Learned Interatomic Potentials. BLIP is a scalable, architecture-agnostic variational Bayesian framework for training or fine-tuning MLIPs, built on an adaptive version of Variational Dropout. BLIP delivers well-calibrated uncertainty estimates and minimal computational overhead for energy and forces prediction at inference time, while integrating seamlessly with (equivariant) message-passing architectures. Empirical results on simulation-based computational chemistry tasks demonstrate improved predictive accuracy with respect to standard MLIPs, and trustworthy uncertainty estimates, especially in data-scarse or heavy out-of-distribution regimes. Moreover, fine-tuning pretrained MLIPs with BLIP yields consistent performance gains and calibrated uncertainties.

[342] Learning from Preferences and Mixed Demonstrations in General Settings

Jason R Brown, Carl Henrik Ek, Robert D Mullins

Main category: cs.LG

TL;DR: LEOPARD is a new algorithm that learns reward functions from both preference feedback and expert demonstrations, outperforming existing methods when limited feedback is available.

DetailsMotivation: Reinforcement learning often struggles with complex tasks where specifying good reward functions is difficult. Existing approaches using both preferences and demonstrations are often ad-hoc, domain-specific, or don't scale well.

Method: Developed reward-rational partial orderings over observations framework, then created LEOPARD algorithm that can learn from various data types including negative demonstrations to efficiently learn reward functions.

Result: LEOPARD significantly outperforms existing baselines when limited preference and demonstration feedback is available. Combining multiple feedback types proves beneficial compared to using just one type.

Conclusion: LEOPARD provides a flexible and scalable approach for learning from human data, demonstrating that combining different types of feedback (preferences and demonstrations) leads to better performance in reward learning.

Abstract: Reinforcement learning is a general method for learning in sequential settings, but it can often be difficult to specify a good reward function when the task is complex. In these cases, preference feedback or expert demonstrations can be used instead. However, existing approaches utilising both together are often ad-hoc, rely on domain-specific properties, or won’t scale. We develop a new framing for learning from human data, \emph{reward-rational partial orderings over observations}, designed to be flexible and scalable. Based on this we introduce a practical algorithm, LEOPARD: Learning Estimated Objectives from Preferences And Ranked Demonstrations. LEOPARD can learn from a broad range of data, including negative demonstrations, to efficiently learn reward functions across a wide range of domains. We find that when a limited amount of preference and demonstration feedback is available, LEOPARD outperforms existing baselines by a significant margin. Furthermore, we use LEOPARD to investigate learning from many types of feedback compared to just a single one, and find that combining feedback types is often beneficial.

[343] Mask and Restore: Blind Backdoor Defense at Test Time with Masked Autoencoder

Tao Sun, Lu Pang, Weimin Lyu, Chao Chen, Haibin Ling

Main category: cs.LG

TL;DR: BDMAE is a blind backdoor defense method that uses Masked AutoEncoder to purify test images by detecting and removing local triggers while preserving semantic content, without requiring model access or validation data.

DetailsMotivation: Existing backdoor defense methods require access to validation data and model parameters, which is impractical for cloud-based models. There's a need for blind defense that works at test time without these requirements.

Method: Leverages Masked AutoEncoder’s reconstruction power to detect local triggers using structural similarity and label consistency between test images and MAE restorations. Refines detection with trigger topology and adaptively fuses restorations into purified images.

Result: Extensive experiments show BDMAE is effective and generalizable across different backdoor settings, successfully defending against local attacks on black-box models.

Conclusion: BDMAE provides a practical solution for blind backdoor defense at test time, overcoming limitations of existing methods by using generative models for trigger detection and purification without requiring model access.

Abstract: Deep neural networks are vulnerable to backdoor attacks, where an adversary manipulates the model behavior through overlaying images with special triggers. Existing backdoor defense methods often require accessing a few validation data and model parameters, which is impractical in many real-world applications, e.g., when the model is provided as a cloud service. In this paper, we address the practical task of blind backdoor defense at test time, in particular for local attacks and black-box models. The true label of every test image needs to be recovered on the fly from a suspicious model regardless of image benignity. We consider test-time image purification that incapacitates local triggers while keeping semantic contents intact. Due to diverse trigger patterns and sizes, the heuristic trigger search can be unscalable. We circumvent such barrier by leveraging the strong reconstruction power of generative models, and propose Blind Defense with Masked AutoEncoder (BDMAE). BDMAE detects possible local triggers using image structural similarity and label consistency between the test image and MAE restorations. The detection results are then refined by considering trigger topology. Finally, we fuse MAE restorations adaptively into a purified image for making prediction. Extensive experiments under different backdoor settings validate its effectiveness and generalizability.

[344] Disentangled Representation Learning with the Gromov-Monge Gap

Théo Uscidda, Luca Eyring, Karsten Roth, Fabian Theis, Zeynep Akata, Marco Cuturi

Main category: cs.LG

TL;DR: Novel disentangled representation learning approach using quadratic optimal transport and Gromov-Monge maps to preserve geometric features while matching prior distributions.

DetailsMotivation: Learning disentangled representations from unlabelled data is fundamental but challenging. Solving it could enable better generalization, interpretability, and fairness. Prior matching approaches work in practice but struggle to preserve geometric features like distances and angles.

Method: Proposes Gromov-Monge maps that transport distributions with minimal geometric distortion. Introduces Gromov-Monge-Gap (GMG) regularizer to quantify geometry preservation. Uses quadratic optimal transport framework.

Result: Outperforms other geometry-preserving methods across four standard benchmarks for disentanglement tasks.

Conclusion: The approach effectively addresses the challenge of matching priors while preserving geometric features through optimal transport theory, demonstrating superior disentanglement performance.

Abstract: Learning disentangled representations from unlabelled data is a fundamental challenge in machine learning. Solving it may unlock other problems, such as generalization, interpretability, or fairness. Although remarkably challenging to solve in theory, disentanglement is often achieved in practice through prior matching. Furthermore, recent works have shown that prior matching approaches can be enhanced by leveraging geometrical considerations, e.g., by learning representations that preserve geometric features of the data, such as distances or angles between points. However, matching the prior while preserving geometric features is challenging, as a mapping that fully preserves these features while aligning the data distribution with the prior does not exist in general. To address these challenges, we introduce a novel approach to disentangled representation learning based on quadratic optimal transport. We formulate the problem using Gromov-Monge maps that transport one distribution onto another with minimal distortion of predefined geometric features, preserving them as much as can be achieved. To compute such maps, we propose the Gromov-Monge-Gap (GMG), a regularizer quantifying whether a map moves a reference distribution with minimal geometry distortion. We demonstrate the effectiveness of our approach for disentanglement across four standard benchmarks, outperforming other methods leveraging geometric considerations.

[345] Correlations Are Ruining Your Gradient Descent

Nasir Ahmad

Main category: cs.LG

TL;DR: This paper connects natural gradient descent, data decorrelation, and backpropagation approximations, showing that decorrelating inputs at each neural network layer addresses fundamental issues identified by natural gradients and significantly improves training speed and approximation methods.

DetailsMotivation: To address the problem illuminated by natural gradient descent - that data correlations cause non-orthonormal parameter relationships in neural networks, and to improve both standard backpropagation and previously failed approximation methods.

Method: Proposes decorrelation and whitening methods for node outputs at each layer of neural networks, including a novel method specifically designed for distributed computing and computational neuroscience applications.

Result: Implementation shows significant speedup in backpropagation training and dramatic improvements in accuracy and convergence speed for previously catastrophic backpropagation approximations.

Conclusion: Decorrelating inputs at each layer provides a viable path forward for previously discarded gradient descent approximations, enables training on analogue/neuromorphic hardware, and offers insights into brain decorrelation processes.

Abstract: Herein the topics of (natural) gradient descent, data decorrelation, and approximate methods for backpropagation are brought into a common discussion. Natural gradient descent illuminates how gradient vectors, pointing at directions of steepest descent, can be improved by considering the local curvature of loss landscapes. We extend this perspective and show that to fully solve the problem illuminated by natural gradients in neural networks, one must recognise that correlations in the data at any linear transformation, including node responses at every layer of a neural network, cause a non-orthonormal relationship between the model’s parameters. To solve this requires a method for decorrelating inputs at each individual layer of a neural network. We describe a range of methods which have been proposed for decorrelation and whitening of node output, and expand on these to provide a novel method specifically useful for distributed computing and computational neuroscience. Implementing decorrelation within multi-layer neural networks, we can show that not only is training via backpropagation sped up significantly but also existing approximations of backpropagation, which have failed catastrophically in the past, benefit significantly in their accuracy and convergence speed. This has the potential to provide a route forward for approximate gradient descent methods which have previously been discarded, training approaches for analogue and neuromorphic hardware, and potentially insights as to the efficacy and utility of decorrelation processes in the brain.

[346] FDR-SVM: A Federated Distributionally Robust Support Vector Machine via a Mixture of Wasserstein Balls Ambiguity Set

Michael Ibrahim, Heraldo Rozas, Nagi Gebraeel, Weijun Xie

Main category: cs.LG

TL;DR: Federated Distributionally Robust SVM for classification with data uncertainty and heterogeneity across clients, using novel Mixture of Wasserstein Balls ambiguity set with theoretical guarantees and efficient algorithms.

DetailsMotivation: Address federated classification with private client data subject to uncertainty in both features and labels, and handle data heterogeneity across clients where each has unique unknown true distribution.

Method: Develop Federated Distributionally Robust SVM (FDR-SVM) with Mixture of Wasserstein Balls (MoWB) ambiguity set to robustify against local data perturbations. Derive two algorithms with convergence analysis and time complexity.

Result: Established theoretical guarantees including out-of-sample performance bound and separability preservation. Algorithms outperform state-of-the-art approaches on industrial data and UCI datasets.

Conclusion: The proposed FDR-SVM with MoWB ambiguity set effectively handles federated classification with data uncertainty and heterogeneity, providing robust performance with theoretical guarantees and practical efficiency.

Abstract: We study a federated classification problem over a network of multiple clients and a central server, in which each client’s local data remains private and is subject to uncertainty in both the features and labels. To address these uncertainties, we develop a novel Federated Distributionally Robust Support Vector Machine (FDR-SVM), robustifying the classification boundary against perturbations in local data distributions. Specifically, the data at each client is governed by a unique true distribution that is unknown. To handle this heterogeneity, we develop a novel Mixture of Wasserstein Balls (MoWB) ambiguity set, naturally extending the classical Wasserstein ball to the federated setting. We then establish theoretical guarantees for our proposed MoWB, deriving an out-of-sample performance bound and showing that its design preserves the separability of the FDR-SVM optimization problem. Next, we rigorously derive two algorithms that solve the FDR-SVM problem and analyze their convergence behavior as well as their worst-case time complexity. We evaluate our algorithms on industrial data and various UCI datasets, whereby we demonstrate that they frequently outperform existing state-of-the-art approaches.

[347] SSD-TS: Exploring the Potential of Linear State Space Models for Diffusion Models in Time Series Imputation

Hongfan Gao, Wangmeng Shen, Xiangfei Qiu, Ronghui Xu, Jilin Hu, Bin Yang

Main category: cs.LG

TL;DR: SSD-TS: A novel probabilistic time series imputation method using Mamba state space models as denoising backbone in diffusion models, achieving SOTA results with improved efficiency and dependency handling.

DetailsMotivation: Current DDPM-based time series imputation methods suffer from high time complexity in sequence modeling and ineffective handling of time series dependencies, limiting their practical application.

Method: Proposes using Mamba state space model as the denoising backbone for DDPMs, with carefully designed SSM-based blocks specifically optimized for time series data modeling to capture dependencies effectively.

Result: Achieves state-of-the-art time series imputation results on multiple real-world datasets, demonstrating superior performance compared to existing methods.

Conclusion: The SSD-TS framework successfully addresses the limitations of current DDPM approaches by leveraging Mamba’s efficient sequence modeling capabilities and specialized SSM blocks, providing an effective solution for probabilistic time series imputation with uncertainty estimation.

Abstract: Probabilistic time series imputation has been widely applied in real-world scenarios due to its ability for uncertainty estimation and denoising diffusion probabilistic models~(DDPMs) have achieved great success in probabilistic time series imputation tasks with its power to model complex distributions. However, current DDPM-based probabilistic time series imputation methodologies are confronted with two types of challenges: 1)\textit{The backbone modules of the denoising parts are not capable of achieving sequence modeling with low time complexity.} 2)~\textit{The architecture of denoising modules can not handle the dependencies in the time series data effectively.} To address the first challenge, we explore the potential of state space model, namely Mamba, as the backbone denoising module for DDPMs. To tackle the second challenge, we carefully devise several SSM-based blocks for time series data modeling. Experimental results demonstrate that our approach can achieve state-of-the-art time series imputation results on multiple real-world datasets. Our datasets and code are available at \href{https://github.com/decisionintelligence/SSD-TS/}{https://github.com/decisionintelligence/SSD-TS/}

[348] A Causal Graph-Enhanced Gaussian Process Regression for Modeling Engine-out NOx

Shrenik Zinage, Ilias Bilionis, Peter Meckl

Main category: cs.LG

TL;DR: Probabilistic Gaussian process models with deep kernels and causal graphs outperform traditional deterministic approaches for NOx emission prediction in diesel engines.

DetailsMotivation: Stringent NOx emission regulations require accurate real-time monitoring, but existing deterministic models lack uncertainty quantification and robustness for diagnostics.

Method: Three Gaussian process regression variants: standard RBF kernel with input window, deep kernel with CNN for temporal dependencies, and deep kernel enhanced with causal graph from graph convolutional networks.

Result: Models show improved predictive performance over virtual ECM sensors, with the causal graph-enhanced deep kernel providing the most significant enhancement.

Conclusion: Probabilistic frameworks with deep learning and physics-informed causal structures offer superior NOx emission prediction capabilities for robust engine diagnostics.

Abstract: The stringent regulatory requirements on nitrogen oxides (NOx) emissions from diesel compression ignition engines require accurate and reliable models for real-time monitoring and diagnostics. Although traditional methods such as physical sensors and virtual engine control module (ECM) sensors provide essential data, they are only used for estimation. Ubiquitous literature primarily focuses on deterministic models with little emphasis on capturing the various uncertainties. The lack of probabilistic frameworks restricts the applicability of these models for robust diagnostics. The objective of this paper is to develop and validate a probabilistic model to predict engine-out NOx emissions using Gaussian process regression. Our approach is as follows. We employ three variants of Gaussian process models: the first with a standard radial basis function kernel with input window, the second incorporating a deep kernel using convolutional neural networks to capture temporal dependencies, and the third enriching the deep kernel with a causal graph derived via graph convolutional networks. The causal graph embeds physics knowledge into the learning process. All models are compared against a virtual ECM sensor using both quantitative and qualitative metrics. We conclude that our model provides an improvement in predictive performance when using an input window and a deep kernel structure. Even more compelling is the further enhancement achieved by the incorporation of a causal graph into the deep kernel. These findings are corroborated across different verification and validation datasets.

[349] Rethinking Weight-Averaged Model-merging

Hu Wang, Congbo Ma, Ibrahim Almakky, Ian Reid, Gustavo Carneiro, Mohammad Yaqub

Main category: cs.LG

TL;DR: This paper provides interpretability analysis of weight-averaged model merging, examining why it works through weight structure analysis, space comparison (weight vs feature), and parameter scaling effects.

DetailsMotivation: Model merging through weight averaging has shown effectiveness but lacks clear interpretability - the authors want to understand why and how this technique works to improve transparency and reliability.

Method: Three-pronged approach: 1) Analyze learned weight structures to understand compatibility, 2) Compare averaging in weight space vs feature space across CNNs and ViTs on diverse datasets, 3) Study parameter scaling effects on prediction stability.

Result: The analysis reveals that model weights encode structured representations enabling compatibility, identifies circumstances where different combination paradigms work effectively, and shows weight averaging acts as regularization for robustness.

Conclusion: Framing model merging through interpretability provides systematic understanding for safer and more reliable untrained model combination methods, contributing to transparency in the field.

Abstract: Model merging, particularly through weight averaging, has shown surprising effectiveness in saving computations and improving model performance without any additional training. However, the interpretability of why and how this technique works remains unclear. In this work, we reinterpret weight-averaged model merging through the lens of interpretability and provide empirical insights into the underlying mechanisms that govern its behavior. We approach the problem from three perspectives: (1) we analyze the learned weight structures and demonstrate that model weights encode structured representations that help explain the compatibility of weight averaging; (2) we compare averaging in weight space and feature space across diverse model architectures (CNNs and ViTs) and datasets, aiming to expose under which circumstances what combination paradigm will work more effectively; (3) we study the effect of parameter scaling on prediction stability, highlighting how weight averaging acts as a form of regularization that contributes to robustness. By framing these analyses in an interpretability context, our work contributes to a more transparent and systematic understanding of model merging for stakeholders interested in the safety and reliability of untrained model combination methods. The code is available at https://github.com/billhhh/Rethink-Merge.

[350] Understanding and Mitigating Memorization in Generative Models via Sharpness of Probability Landscapes

Dongjae Jeon, Dueun Kim, Albert No

Main category: cs.LG

TL;DR: Geometric framework analyzes diffusion model memorization through log probability density sharpness, validates existing metric, proposes new early-stage metric, and develops mitigation strategy with sharpness-aware regularization.

DetailsMotivation: To mathematically understand and quantify memorization in diffusion models through geometric analysis of probability density sharpness, enabling early detection and mitigation of memorization issues.

Method: Developed geometric framework analyzing log probability density sharpness, validated score-difference-based metric, proposed novel metric for early-stage detection in latent diffusion models, and created mitigation strategy using sharpness-aware regularization on initial noise optimization.

Result: Mathematically justified existing memorization metric, demonstrated effectiveness in quantifying sharpness, and developed working mitigation approach that optimizes generation process to reduce memorization.

Conclusion: The geometric framework provides rigorous mathematical foundation for memorization analysis in diffusion models, with practical metrics and mitigation strategies that can detect and reduce memorization early in the generation process.

Abstract: In this paper, we introduce a geometric framework to analyze memorization in diffusion models through the sharpness of the log probability density. We mathematically justify a previously proposed score-difference-based memorization metric by demonstrating its effectiveness in quantifying sharpness. Additionally, we propose a novel memorization metric that captures sharpness at the initial stage of image generation in latent diffusion models, offering early insights into potential memorization. Leveraging this metric, we develop a mitigation strategy that optimizes the initial noise of the generation process using a sharpness-aware regularization term. The code is publicly available at https://github.com/Dongjae0324/sharpness_memorization_diffusion.

[351] DDD-GenDT: Dynamic Data-driven Generative Digital Twin Framework

Yu-Zheng Lin, Qinxuan Shi, Zhanglong Yang, Banafsheh Saber Latibari, Shalaka Satam, Sicong Shao, Soheil Salehi, Pratik Satam

Main category: cs.LG

TL;DR: DDD-GenDT is a dynamic data-driven generative digital twin framework that uses LLM ensembles for zero-shot predictive inference, reducing data requirements while maintaining accuracy and adaptability to system aging.

DetailsMotivation: Digital twin technology faces challenges with high data requirements, proprietary data constraints, and limited adaptability to evolving conditions in physical systems.

Method: The framework includes Physical Twin Observation Graph for operational states, Observation Window Extraction for temporal sequences, Data Preprocessing Pipeline for sensor structuring, and LLM ensemble for zero-shot predictive inference using generative AI.

Result: In zero-shot testing with NASA CNC milling dataset, GPT-4-based DT achieved average RMSE of 0.479A (4.79% of 10A spindle current), accurately modeling nonlinear dynamics and system aging without retraining.

Conclusion: DDD-GenDT provides a generalizable, data-efficient, and adaptive digital twin modeling approach that bridges generative AI with industrial performance and reliability requirements.

Abstract: Digital twin (DT) technology enables real-time simulation, prediction, and optimization of physical systems, but practical deployment faces challenges from high data requirements, proprietary data constraints, and limited adaptability to evolving conditions. This work introduces DDD-GenDT, a dynamic data-driven generative digital twin framework grounded in the Dynamic Data-Driven Application Systems (DDDAS) paradigm. The architecture comprises the Physical Twin Observation Graph (PTOG) to represent operational states, an Observation Window Extraction process to capture temporal sequences, a Data Preprocessing Pipeline for sensor structuring and filtering, and an LLM ensemble for zero-shot predictive inference. By leveraging generative AI, DDD-GenDT reduces reliance on extensive historical datasets, enabling DT construction in data-scarce settings while maintaining industrial data privacy. The DDDAS feedback mechanism allows the DT to autonomically adapt predictions to physical twin (PT) wear and degradation, supporting DT-aging, which ensures progressive synchronization of DT with PT evolution. The framework is validated using the NASA CNC milling dataset, with spindle current as the monitored variable. In a zero-shot setting, the GPT-4-based DT achieves an average RMSE of 0.479 A (4.79% of the 10 A spindle current), accurately modeling nonlinear process dynamics and PT aging without retraining. These results show that DDD-GenDT provides a generalizable, data-efficient, and adaptive DT modeling approach, bridging generative AI with the performance and reliability requirements of industrial DT applications.

[352] High-Order Tensor Regression in Sparse Convolutional Neural Networks

Roberto Dias Algarte

Main category: cs.LG

TL;DR: A novel tensor-based convolution approach that redefines backpropagation for sparse CNNs

DetailsMotivation: To develop a more mathematically clear and concise convolution methodology, especially for high-order tensors, and create a rational framework for sparse convolutional neural networks

Method: Generic tensor-based convolution approach that differs from conventional ML methods, developing a rational theory of regression in neural networks as a framework for sparse CNNs

Result: The approach proved mathematically clear and concise for high-order tensors, and enabled redefinition of the classic Backpropagation Algorithm into its simplest, most generic form

Conclusion: The study presents a significant departure from conventional convolution methodologies, offering a rational tensor-based framework that simplifies backpropagation and provides a generic view of sparse convolutional neural networks

Abstract: This article presents a generic approach to convolution that significantly differs from conventional methodologies in the current Machine Learning literature. The approach, in its mathematical aspects, proved to be clear and concise, particularly when high-order tensors are involved. In this context, a rational theory of regression in neural networks is developed, as a framework for a generic view of sparse convolutional neural networks, the primary focus of this study. As a direct outcome, the classic Backpropagation Algorithm is redefined to align with this rational tensor-based approach and presented in its simplest, most generic form.

[353] Environmental Feature Engineering and Statistical Validation for ML-Based Path Loss Prediction

Jonathan Ethier, Mathieu Chateauvert, Ryan G. Dempsey, Alexis Bose

Main category: cs.LG

TL;DR: Machine learning-based path loss modeling using extended geographic features improves prediction accuracy and demonstrates model generalization through statistical validation.

DetailsMotivation: Traditional path loss modeling lacks physical environmental details, but increasingly available high-resolution geographic data enables more accurate wireless coverage and interference predictions.

Method: Extended feature set for machine learning-based propagation modeling, with rigorous statistical assessment and test set holdouts to prove generalization.

Result: Improved prediction accuracy in path loss modeling while demonstrating strong model generalization capabilities.

Conclusion: Feature-based machine learning approaches with extended geographic features enable accurate, efficient, and scalable wireless propagation modeling with proven generalization.

Abstract: Wireless communications rely on path loss modeling, which is most effective when it includes the physical details of the propagation environment. Acquiring this data has historically been challenging, but geographic information systems data is becoming increasingly available with higher resolution and accuracy. Access to such details enables propagation models to more accurately predict coverage and account for interference in wireless deployments. Machine learning-based modeling can significantly support this effort, with feature based approaches allowing for accurate, efficient, and scalable propagation modeling. Building on previous work, we introduce an extended set of features that improves prediction accuracy while, most importantly, proving model generalization through rigorous statistical assessment and the use of test set holdouts.

[354] Closed-Form Feedback-Free Learning with Forward Projection

Robert O’Shea, Bipin Rajendran

Main category: cs.LG

TL;DR: Forward Projection (FP) is a novel backpropagation-free training method that uses single forward pass without retrograde communication, achieving comparable performance to gradient descent methods with significant speedup and improved interpretability.

DetailsMotivation: To address the restrictive setting where retrograde communication from neuronal outputs is unavailable for pre-synaptic weight optimisation, enabling training without feedback mechanisms.

Method: Layer-wise nonlinear projections of pre-synaptic inputs and labels generate target values for pre-activation membrane potentials. Local loss functions are optimised using closed-form regression without feedback from neuronal outputs or downstream layers.

Result: FP demonstrated effectiveness across four biomedical datasets, yielding more generalisable models in few-shot learning and comparable generalization in large-sample tasks with significant speedup. Interpretation functions successfully identified clinically salient features.

Conclusion: Forward Projection is a computationally efficient approach that yields interpretable neural network models without retrograde communication during training, offering advantages in speed, generalization, and clinical interpretability.

Abstract: State-of-the-art methods for backpropagation-free learning employ local error feedback to direct iterative optimisation via gradient descent. In this study, we examine the more restrictive setting where retrograde communication from neuronal outputs is unavailable for pre-synaptic weight optimisation. To address this challenge, we propose Forward Projection (FP). This novel randomised closed-form training method requires only a single forward pass over the entire dataset for model fitting, without retrograde communication. Target values for pre-activation membrane potentials are generated layer-wise via nonlinear projections of pre-synaptic inputs and the labels. Local loss functions are optimised over pre-synaptic inputs using closed-form regression, without feedback from neuronal outputs or downstream layers. Interpretability is a key advantage of FP training; membrane potentials of hidden neurons in FP-trained networks encode information which is interpretable layer-wise as label predictions. We demonstrate the effectiveness of FP across four biomedical datasets. In few-shot learning tasks, FP yielded more generalisable models than those optimised via backpropagation. In large-sample tasks, FP-based models achieve generalisation comparable to gradient descent-based local learning methods while requiring only a single forward propagation step, achieving significant speed up for training. Interpretation functions defined on local neuronal activity in FP-based models successfully identified clinically salient features for diagnosis in two biomedical datasets. Forward Projection is a computationally efficient machine learning approach that yields interpretable neural network models without retrograde communication of neuronal activity during training.

[355] Joint Learning of Energy-based Models and their Partition Function

Michael E. Sander, Vincent Roulet, Tianlin Liu, Mathieu Blondel

Main category: cs.LG

TL;DR: A novel method for learning energy-based models in discrete spaces by jointly learning energy and log-partition functions as neural networks, providing tractable optimization without MCMC sampling.

DetailsMotivation: Learning EBMs by exact maximum likelihood estimation is intractable due to partition function computation, especially in combinatorially-large discrete spaces like sets or permutations.

Method: Jointly learn both an energy model and its log-partition function as neural networks, enabling tractable optimization via stochastic gradient descent without MCMC sampling.

Result: The approach recovers optimal MLE solution in continuous function space, extends to Fenchel-Young losses, and enables tractable sparsemax optimization in large combinatorial spaces.

Conclusion: The proposed method provides an effective framework for learning EBMs in discrete spaces with theoretical guarantees and practical applications in multilabel classification and label ranking.

Abstract: Energy-based models (EBMs) offer a flexible framework for parameterizing probability distributions using neural networks. However, learning EBMs by exact maximum likelihood estimation (MLE) is generally intractable, due to the need to compute the partition function (normalization constant). In this paper, we propose a novel formulation for approximately learning probabilistic EBMs in combinatorially-large discrete spaces, such as sets or permutations. Our key idea is to jointly learn both an energy model and its log-partition, both parameterized as a neural network. Our approach not only provides a novel tractable objective criterion to learn EBMs by stochastic gradient descent (without relying on MCMC), but also a novel means to estimate the log-partition function on unseen data points. On the theoretical side, we show that our approach recovers the optimal MLE solution when optimizing in the space of continuous functions. Furthermore, we show that our approach naturally extends to the broader family of Fenchel-Young losses, allowing us to obtain the first tractable method for optimizing the sparsemax loss in combinatorially-large spaces. We demonstrate our approach on multilabel classification and label ranking.

[356] Augmented Adversarial Trigger Learning

Zhe Wang, Yanjun Qi

Main category: cs.LG

TL;DR: ATLA improves adversarial trigger learning by using weighted loss to focus on response format tokens and suppress evasive responses, achieving near 100% success with 80% fewer queries.

DetailsMotivation: Previous gradient optimization-based adversarial attack methods use negative log-likelihood loss, which may not effectively optimize towards response format tokens or handle evasive responses.

Method: Proposes ATLA with weighted loss formulation that emphasizes response format tokens and includes auxiliary loss to suppress evasive responses. Learns from single query-response pair.

Result: Achieves nearly 100% attack success rate, requires 80% fewer queries, demonstrates high generalization to unseen queries and transferability to new LLMs.

Conclusion: ATLA significantly outperforms state-of-the-art techniques in jailbreaking LLMs and extracting hidden system prompts through improved optimization objectives.

Abstract: Gradient optimization-based adversarial attack methods automate the learning of adversarial triggers to generate jailbreak prompts or leak system prompts. In this work, we take a closer look at the optimization objective of adversarial trigger learning and propose ATLA: Adversarial Trigger Learning with Augmented objectives. ATLA improves the negative log-likelihood loss used by previous studies into a weighted loss formulation that encourages the learned adversarial triggers to optimize more towards response format tokens. This enables ATLA to learn an adversarial trigger from just one query-response pair and the learned trigger generalizes well to other similar queries. We further design a variation to augment trigger optimization with an auxiliary loss that suppresses evasive responses. We showcase how to use ATLA to learn adversarial suffixes jailbreaking LLMs and to extract hidden system prompts. Empirically we demonstrate that ATLA consistently outperforms current state-of-the-art techniques, achieving nearly 100% success in attacking while requiring 80% fewer queries. ATLA learned jailbreak suffixes demonstrate high generalization to unseen queries and transfer well to new LLMs. We released our code https://github.com/QData/ALTA_Augmented_Adversarial_Trigger_Learning

[357] Enhancing Cost Efficiency in Active Learning with Candidate Set Query

Yeho Gwon, Sehyun Hwang, Hoyoung Kim, Jungseul Ok, Suha Kwak

Main category: cs.LG

TL;DR: A cost-efficient active learning framework using candidate set queries that reduces labeling costs by 48% on ImageNet64x64 by narrowing down possible classes for oracle examination.

DetailsMotivation: Traditional active learning requires oracles to examine all possible classes, which is expensive and inefficient. The goal is to reduce labeling costs while maintaining model performance.

Method: Uses candidate set queries that narrow down likely ground-truth classes, leverages conformal prediction for reliable candidate sets, and employs an acquisition function prioritizing high information gain at lower cost.

Result: Empirical evaluations on CIFAR-10, CIFAR-100, and ImageNet64x64 show effectiveness and scalability, with 48% reduction in labeling cost on ImageNet64x64.

Conclusion: The proposed framework significantly reduces labeling costs in active learning while maintaining performance through efficient candidate set queries and adaptive conformal prediction.

Abstract: This paper introduces a cost-efficient active learning (AL) framework for classification, featuring a novel query design called candidate set query. Unlike traditional AL queries requiring the oracle to examine all possible classes, our method narrows down the set of candidate classes likely to include the ground-truth class, significantly reducing the search space and labeling cost. Moreover, we leverage conformal prediction to dynamically generate small yet reliable candidate sets, adapting to model enhancement over successive AL rounds. To this end, we introduce an acquisition function designed to prioritize data points that offer high information gain at lower cost. Empirical evaluations on CIFAR-10, CIFAR-100, and ImageNet64x64 demonstrate the effectiveness and scalability of our framework. Notably, it reduces labeling cost by 48% on ImageNet64x64. The project page can be found at https://yehogwon.github.io/csq-al.

[358] Parameter-Efficient Continual Fine-Tuning: A Survey

Eric Nuertey Coleman, Luigi Quarantiello, Ziyue Liu, Qinwen Yang, Samrat Mukherjee, Julio Hurtado, Vincenzo Lomonaco

Main category: cs.LG

TL;DR: Survey paper on Parameter-Efficient Continual Fine-Tuning (PECFT) that bridges Continual Learning and Parameter-Efficient Fine-Tuning to address catastrophic forgetting in large pre-trained models.

DetailsMotivation: Large pre-trained models struggle with dynamic learning scenarios due to their dependence on i.i.d. assumptions and suffer from catastrophic forgetting when adapting to multiple tasks sequentially.

Method: Comprehensive review and analysis of CL algorithms, PEFT methods, and state-of-the-art PECFT approaches, including evaluation metrics and research directions.

Result: Identifies synergies between Continual Learning and Parameter-Efficient Fine-Tuning, providing guidance for researchers and outlining future research pathways.

Conclusion: PECFT represents a promising direction for enabling lifelong learning in large-scale models by combining efficient adaptation with continual learning capabilities to overcome catastrophic forgetting.

Abstract: The emergence of large pre-trained networks has revolutionized the AI field, unlocking new possibilities and achieving unprecedented performance. However, these models inherit a fundamental limitation from traditional Machine Learning approaches: their strong dependence on the \textit{i.i.d.} assumption hinders their adaptability to dynamic learning scenarios. We believe the next breakthrough in AI lies in enabling efficient adaptation to evolving environments – such as the real world – where new data and tasks arrive sequentially. This challenge defines the field of Continual Learning (CL), a Machine Learning paradigm focused on developing lifelong learning neural models. One alternative to efficiently adapt these large-scale models is known Parameter-Efficient Fine-Tuning (PEFT). These methods tackle the issue of adapting the model to a particular data or scenario by performing small and efficient modifications, achieving similar performance to full fine-tuning. However, these techniques still lack the ability to adjust the model to multiple tasks continually, as they suffer from the issue of Catastrophic Forgetting. In this survey, we first provide an overview of CL algorithms and PEFT methods before reviewing the state-of-the-art on Parameter-Efficient Continual Fine-Tuning (PECFT). We examine various approaches, discuss evaluation metrics, and explore potential future research directions. Our goal is to highlight the synergy between CL and Parameter-Efficient Fine-Tuning, guide researchers in this field, and pave the way for novel future research directions.

[359] Recommendations with Sparse Comparison Data: Provably Fast Convergence for Nonconvex Matrix Factorization

Suryanarayana Sankagiri, Jalal Etesami, Matthias Grossglauser

Main category: cs.LG

TL;DR: Theoretical analysis shows gradient-based methods can efficiently learn personalized recommendations from pairwise comparison data, even with sparse user feedback.

DetailsMotivation: Traditional recommender systems rely on individual item ratings, but users often provide feedback through pairwise comparisons. This paper aims to theoretically analyze whether learning from comparison data is computationally and statistically efficient.

Method: The approach assumes comparisons stem from latent user and item features, reducing preference prediction to learning these features. The analysis extends concentration inequalities from matrix completion to show the loss function exhibits restricted strong convexity near the true solution.

Result: The analysis demonstrates that gradient-based methods converge exponentially when given an appropriate warm start, even in sparse data regimes where each user compares only a few item pairs.

Conclusion: Learning personalized recommendations from comparison data is both computationally efficient (through gradient methods) and statistically efficient (works with sparse data), providing theoretical foundation for comparison-based recommender systems.

Abstract: This paper provides a theoretical analysis of a new learning problem for recommender systems where users provide feedback by comparing pairs of items instead of rating them individually. We assume that comparisons stem from latent user and item features, which reduces the task of predicting preferences to learning these features from comparison data. Similar to the classical matrix factorization problem, the main challenge in this learning task is that the resulting loss function is nonconvex. Our analysis shows that the loss function exhibits (restricted) strong convexity near the true solution, which ensures gradient-based methods converge exponentially, given an appropriate warm start. Importantly, this result holds in a sparse data regime, where each user compares only a few pairs of items. Our main technical contribution is to extend certain concentration inequalities commonly used in matrix completion to our model. Our work demonstrates that learning personalized recommendations from comparison data is computationally and statistically efficient.

[360] POPri: Private Federated Learning using Preference-Optimized Synthetic Data

Charlie Hou, Mei-Yu Wang, Yige Zhu, Daniel Lazar, Giulia Fanti

Main category: cs.LG

TL;DR: POPri uses reinforcement learning (policy optimization) to generate high-quality differentially private synthetic data for federated learning, outperforming previous DP-FL and synthetic data methods by significantly closing the utility gap between private and non-private settings.

DetailsMotivation: Current DP-FL methods may be enhanced by DP synthetic data approaches, but existing methods require careful prompt engineering and iterative private client feedback. The authors recognize that this feedback can be treated as RL rewards to optimize synthetic data generation.

Method: POPri (Policy Optimization for Private Data) uses policy optimization algorithms like Direct Preference Optimization (DPO) to fine-tune LLMs, harnessing client feedback as RL rewards to generate high-quality DP synthetic data. Also introduces LargeFedBench benchmark for evaluation.

Result: POPri substantially improves DP synthetic data utility, closing the gap between next-token prediction accuracy in fully-private vs non-private settings by up to 58% (vs 28% for prior synthetic data methods and 3% for state-of-the-art DP-FL methods).

Conclusion: Reinforcement learning-based policy optimization effectively leverages client feedback to generate high-quality differentially private synthetic data, significantly outperforming existing approaches and narrowing the privacy-utility tradeoff in federated learning.

Abstract: In practical settings, differentially private Federated learning (DP-FL) is the dominant method for training models from private, on-device client data. Recent work has suggested that DP-FL may be enhanced or outperformed by methods that use DP synthetic data (Wu et al., 2024; Hou et al., 2024). The primary algorithms for generating DP synthetic data for FL applications require careful prompt engineering based on public information and/or iterative private client feedback. Our key insight is that the private client feedback collected by prior DP synthetic data methods (Hou et al., 2024; Xie et al., 2024) can be viewed as an RL (reinforcement learning) reward. Our algorithm, Policy Optimization for Private Data (POPri) harnesses client feedback using policy optimization algorithms such as Direct Preference Optimization (DPO) to fine-tune LLMs to generate high-quality DP synthetic data. To evaluate POPri, we release LargeFedBench, a new federated text benchmark for uncontaminated LLM evaluations on federated client data. POPri substantially improves the utility of DP synthetic data relative to prior work on LargeFedBench datasets and an existing benchmark from Xie et al. (2024). POPri closes the gap between next-token prediction accuracy in the fully-private and non-private settings by up to 58%, compared to 28% for prior synthetic data methods, and 3% for state-of-the-art DP federated learning methods. The code and data are available at https://github.com/meiyuw/POPri.

[361] A kinetic-based regularization method for data science applications

Abhisek Ganguly, Alessandro Gabbana, Vybhav Rao, Sauro Succi, Santosh Ansumali

Main category: cs.LG

TL;DR: Physics-based regularization technique using statistical mechanics analogy to improve function learning accuracy without empirical parameter tuning.

DetailsMotivation: To minimize discrepancy between discrete and continuum data representations and access more favorable energy landscapes for better interpolation/regression performance.

Method: Drawing analogy between parameter optimization and energy minimization, introducing corrections that constrain lower-order moments of data distribution.

Result: Improved performance in interpolation and regression tasks, even in high dimensions; effective for noisy data without parameter tuning; computationally efficient for large datasets.

Conclusion: Physics-inspired regularization provides accurate, efficient function learning with advantages over traditional methods, particularly for noisy data and large-scale problems.

Abstract: We propose a physics-based regularization technique for function learning, inspired by statistical mechanics. By drawing an analogy between optimizing the parameters of an interpolator and minimizing the energy of a system, we introduce corrections that impose constraints on the lower-order moments of the data distribution. This minimizes the discrepancy between the discrete and continuum representations of the data, in turn allowing to access more favorable energy landscapes, thus improving the accuracy of the interpolator. Our approach improves performance in both interpolation and regression tasks, even in high-dimensional spaces. Unlike traditional methods, it does not require empirical parameter tuning, making it particularly effective for handling noisy data. We also show that thanks to its local nature, the method offers computational and memory efficiency advantages over Radial Basis Function interpolators, especially for large datasets.

[362] Performance Comparisons of Reinforcement Learning Algorithms for Sequential Experimental Design

Yasir Zubayr Barlas, Kizito Salako

Main category: cs.LG

TL;DR: RL algorithms for sequential experimental design show varying performance, with dropout and ensemble methods demonstrating better generalization.

DetailsMotivation: To address the lack of understanding about which reinforcement learning algorithms work best for training agents in sequential experimental design problems that require good generalization across changing statistical properties.

Method: Investigated several reinforcement learning algorithms for training agents to navigate design spaces and select informative designs sequentially, focusing on algorithms with dropout and ensemble approaches.

Result: Agent performance varies significantly depending on the training algorithm used, with dropout and ensemble methods empirically showing attractive generalization properties.

Conclusion: The choice of reinforcement learning algorithm critically impacts agent performance in sequential experimental design, and specific approaches like dropout and ensembles offer superior generalization capabilities.

Abstract: Recent developments in sequential experimental design look to construct a policy that can efficiently navigate the design space, in a way that maximises the expected information gain. Whilst there is work on achieving tractable policies for experimental design problems, there is significantly less work on obtaining policies that are able to generalise well - i.e. able to give good performance despite a change in the underlying statistical properties of the experiments. Conducting experiments sequentially has recently brought about the use of reinforcement learning, where an agent is trained to navigate the design space to select the most informative designs for experimentation. However, there is still a lack of understanding about the benefits and drawbacks of using certain reinforcement learning algorithms to train these agents. In our work, we investigate several reinforcement learning algorithms and their efficacy in producing agents that take maximally informative design decisions in sequential experimental design scenarios. We find that agent performance is impacted depending on the algorithm used for training, and that particular algorithms, using dropout or ensemble approaches, empirically showcase attractive generalisation properties.

[363] Position: We Need Responsible, Application-Driven (RAD) AI Research

Sarah Hartman, Cheng Soon Ong, Julia Powles, Petra Kuhnert

Main category: cs.LG

TL;DR: RAD-AI proposes a responsible, application-driven approach to AI research that focuses on real-world contexts, ethical considerations, and community needs through transdisciplinary collaboration and staged testing.

DetailsMotivation: As AI becomes increasingly integrated into society, researchers need to engage with specific application contexts and address ethical, legal, and societal considerations to achieve meaningful scientific and societal advances.

Method: Three-staged approach: (1) building transdisciplinary teams and people-centred studies, (2) addressing context-specific methods, ethical commitments, assumptions, and metrics, and (3) testing and sustaining efficacy through staged testbeds and a community of practice.

Result: The paper presents a vision for application-driven AI research that can unlock new value through technically feasible methods adaptive to contextual needs and community values.

Conclusion: A responsible, application-driven approach (RAD-AI) is necessary to ensure AI research delivers meaningful advances that serve community needs while addressing ethical and societal constraints.

Abstract: This position paper argues that achieving meaningful scientific and societal advances with artificial intelligence (AI) requires a responsible, application-driven approach (RAD) to AI research. As AI is increasingly integrated into society, AI researchers must engage with the specific contexts where AI is being applied. This includes being responsive to ethical and legal considerations, technical and societal constraints, and public discourse. We present the case for RAD-AI to drive research through a three-staged approach: (1) building transdisciplinary teams and people-centred studies; (2) addressing context-specific methods, ethical commitments, assumptions, and metrics; and (3) testing and sustaining efficacy through staged testbeds and a community of practice. We present a vision for the future of application-driven AI research to unlock new value through technically feasible methods that are adaptive to the contextual needs and values of the communities they ultimately serve.

[364] Langevin Monte-Carlo Provably Learns Depth Two Neural Nets at Any Size and Data

Dibyakanti Kumar, Samyak Jha, Anirbit Mukherjee

Main category: cs.LG

TL;DR: Langevin Monte-Carlo algorithm can learn depth-2 neural networks of any size with non-asymptotic convergence rates, requiring regularization independent of network size.

DetailsMotivation: To establish theoretical guarantees for Langevin Monte-Carlo in learning neural networks and provide non-asymptotic convergence rates for depth-2 networks of arbitrary size.

Method: Analyze convergence of Langevin Monte-Carlo iterates in q-Renyi divergence to Gibbs distribution of Frobenius norm regularized losses for depth-2 neural nets with smooth activations in classification and regression settings.

Result: LMC converges to Gibbs distribution with regularization amount independent of network size, satisfying Villani conditions and Poincare inequality for the Gibbs measures.

Conclusion: The work synthesizes recent observations about isoperimetry conditions for LMC convergence and demonstrates that two-layer neural loss functions can be regularized by a constant amount to satisfy necessary conditions for convergence.

Abstract: In this work, we will establish that the Langevin Monte-Carlo algorithm can learn depth-2 neural nets of any size and for any data and we give non-asymptotic convergence rates for it. We achieve this via showing that in q-Renyi divergence, the iterates of Langevin Monte Carlo converge to the Gibbs distribution of Frobenius norm regularized losses for any of these nets, when using smooth activations and in both classification and regression settings. Most critically, the amount of regularization needed for our results is independent of the size of the net. This result achieves a synthesis of several recent observations about isoperimetry conditions under which LMC converges and that two-layer neural loss functions can always be regularized by a certain constant amount such that they satisfy the Villani conditions, and thus their Gibbs measures satisfy a Poincare inequality.

[365] Incorporating Attributes and Multi-Scale Structures for Heterogeneous Graph Contrastive Learning

Ruobing Jiang, Yacong Li, Haobing Liu, Yanwei Yu

Main category: cs.LG

TL;DR: ASHGCL is a novel contrastive learning framework for heterogeneous graphs that uses three distinct views to capture attribute, high-order, and low-order structural information, with an attribute-enhanced positive sample selection strategy to address sampling bias.

DetailsMotivation: Heterogeneous graphs effectively capture real-world complex relational structures, but labeled data is often scarce in real scenarios, limiting semi-supervised approaches. Self-supervised learning can address the challenge of limited labeled data.

Method: Proposes ASHGCL framework with three distinct views focusing on node attributes, high-order structural information, and low-order structural information. Introduces an attribute-enhanced positive sample selection strategy that combines structural and attribute information.

Result: Extensive experiments on four real-world datasets show ASHGCL outperforms state-of-the-art unsupervised baselines and even surpasses some supervised benchmarks.

Conclusion: The proposed ASHGCL framework effectively addresses the challenge of limited labeled data in heterogeneous graphs through innovative contrastive learning with multiple views and enhanced sampling strategies, achieving superior performance compared to existing methods.

Abstract: Heterogeneous graphs (HGs) are composed of multiple types of nodes and edges, making it more effective in capturing the complex relational structures inherent in the real world. However, in real-world scenarios, labeled data is often difficult to obtain, which limits the applicability of semi-supervised approaches. Self-supervised learning aims to enable models to automatically learn useful features from data, effectively addressing the challenge of limited labeling data. In this paper, we propose a novel contrastive learning framework for heterogeneous graphs (ASHGCL), which incorporates three distinct views, each focusing on node attributes, high-order and low-order structural information, respectively, to effectively capture attribute information, high-order structures, and low-order structures for node representation learning. Furthermore, we introduce an attribute-enhanced positive sample selection strategy that combines both structural information and attribute information, effectively addressing the issue of sampling bias. Extensive experiments on four real-world datasets show that ASHGCL outperforms state-of-the-art unsupervised baselines and even surpasses some supervised benchmarks.

[366] Sample Complexity of Diffusion Model Training Without Empirical Risk Minimizer Access

Mudit Gaur, Prashant Trivedi, Sasidhar Kunapuli, Amrit Singh Bedi, Vaneet Aggarwal

Main category: cs.LG

TL;DR: First sample complexity analysis for diffusion models that achieves O(ε⁻⁶) bound without assuming access to exact empirical risk minimizers, eliminating exponential dependence on neural network parameters.

DetailsMotivation: Prior theoretical analyses of diffusion models suffered from poor scaling with input dimension or relied on unrealistic assumptions about access to exact empirical risk minimizers.

Method: Leverages a structured decomposition of score estimation error into statistical, approximation, and optimization errors to analyze sample complexity.

Result: Establishes a sample complexity bound of O(ε⁻⁶) for score estimation in diffusion models, the first such result without assuming access to empirical risk minimizers.

Conclusion: Provides a principled theoretical foundation for diffusion models with practical sample complexity bounds that scale better than previous analyses.

Abstract: Diffusion models have demonstrated state-of-the-art performance across vision, language, and scientific domains. Despite their empirical success, prior theoretical analyses of the sample complexity suffer from poor scaling with input data dimension or rely on unrealistic assumptions such as access to exact empirical risk minimizers. In this work, we provide a principled analysis of score estimation, establishing a sample complexity bound of $\widetilde{\mathcal{O}}(\epsilon^{-6})$. Our approach leverages a structured decomposition of the score estimation error into statistical, approximation, and optimization errors, enabling us to eliminate the exponential dependence on neural network parameters that arises in prior analyses. It is the first such result which achieves sample complexity bounds without assuming access to the empirical risk minimizer of score function estimation loss.

[367] Reinforcement Learning for Solving the Pricing Problem in Column Generation: Applications to Vehicle Routing

Abdo Abouelrous, Laurens Bliek, Adriana F. Gabor, Yaoxin Wu, Yingqian Zhang

Main category: cs.LG

TL;DR: RL-based attention model for column generation pricing problem in vehicle routing, achieving faster solving times compared to DP heuristics.

DetailsMotivation: Address the column generation problem using reinforcement learning to find columns with most negative reduced cost in pricing problems without relying on heuristics.

Method: End-to-end reinforcement learning model with attention-mechanism architecture to independently solve pricing problems in column generation framework.

Result: Method solves linear relaxation with reasonable objective gap in significantly shorter running times compared to Dynamic Programming-based heuristics.

Conclusion: RL-based approach provides efficient alternative to traditional heuristics for column generation pricing problems, particularly in vehicle routing applications.

Abstract: In this paper, we address the problem of Column Generation (CG) using Reinforcement Learning (RL). Specifically, we use a RL model based on the attention-mechanism architecture to find the columns with most negative reduced cost in the Pricing Problem (PP). Unlike previous Machine Learning (ML) applications for CG, our model deploys an end-to-end mechanism as it independently solves the pricing problem without the help of any heuristic. We consider a variant of Vehicle Routing Problem (VRP) as a case study for our method. Through a set of experiments where our method is compared against a Dynamic Programming (DP)-based heuristic for solving the PP, we show that our method solves the linear relaxation up to a reasonable objective gap in significantly shorter running times.

[368] G1: Teaching LLMs to Reason on Graphs with Reinforcement Learning

Xiaojun Guo, Ang Li, Yifei Wang, Stefanie Jegelka, Yisen Wang

Main category: cs.LG

TL;DR: G1 uses reinforcement learning on synthetic graph tasks to significantly improve LLMs’ graph reasoning abilities, with a 3B model outperforming much larger models and showing strong generalization.

DetailsMotivation: LLMs have limited proficiency in graph-related tasks despite general progress, and previous approaches face challenges with scarce large-scale graph data.

Method: Reinforcement Learning on Erdös dataset - 50 diverse graph-theoretic tasks with 100k training data derived from real-world graphs.

Result: 3B model outperforms Qwen2.5-72B-Instruct (24x larger), shows strong zero-shot generalization to unseen tasks/domains, and maintains general reasoning abilities.

Conclusion: RL on synthetic graph-theoretic tasks provides an efficient, scalable path to build strong graph reasoners by eliciting LLMs’ latent graph understanding capabilities.

Abstract: Although Large Language Models (LLMs) have demonstrated remarkable progress, their proficiency in graph-related tasks remains notably limited, hindering the development of truly general-purpose models. Previous attempts, including pretraining graph foundation models or employing supervised fine-tuning, often face challenges such as the scarcity of large-scale, universally represented graph data. We introduce G1, a simple yet effective approach demonstrating that Reinforcement Learning (RL) on synthetic graph-theoretic tasks can significantly scale LLMs’ graph reasoning abilities. To enable RL training, we curate Erd~os, the largest graph reasoning dataset to date comprising 50 diverse graph-theoretic tasks of varying difficulty levels, 100k training data and 5k test data, all drived from real-world graphs. With RL on Erd~os, G1 obtains substantial improvements in graph reasoning, where our finetuned 3B model even outperforms Qwen2.5-72B-Instruct (24x size). RL-trained models also show strong zero-shot generalization to unseen tasks, domains, and graph encoding schemes, including other graph-theoretic benchmarks as well as real-world node classification and link prediction tasks, without compromising general reasoning abilities. Our findings offer an efficient, scalable path for building strong graph reasoners by finetuning LLMs with RL on graph-theoretic tasks, which combines the strengths of pretrained LLM capabilities with abundant, automatically generated synthetic data, suggesting that LLMs possess graph understanding abilities that RL can elicit successfully. Our implementation is open-sourced at https://github.com/PKU-ML/G1, with models and datasets hosted on Hugging Face collections https://huggingface.co/collections/PKU-ML/g1-683d659e992794fc99618cf2 for broader accessibility.

[369] MEGA: Second-Order Gradient Alignment for Catastrophic Forgetting Mitigation in GFSCIL

Jinhui Pang, Changqing Lin, Hao Lin, Zhihui Zhang, Long Chen, Weiping Ding, Yu Liu, Xiaoshuai Hao

Main category: cs.LG

TL;DR: Proposes MEGA, a model-agnostic meta graph continual learning method for GFSCIL that excludes query sets during incremental training and uses second-order gradients to learn high-quality priors, achieving state-of-the-art results.

DetailsMotivation: Existing GFSCIL methods oversimplify learning through query set fine-tuning and fail to integrate Graph Continual Learning techniques due to architectural constraints, requiring a more rigorous and practical setting.

Method: Introduces Model-Agnostic Meta Graph Continual Learning (MEGA) that calculates incremental second-order gradients during meta-training to learn high-quality priors that align model behaviors across meta-training and incremental learning stages.

Result: Extensive experiments on four mainstream graph datasets demonstrate that MEGA achieves state-of-the-art results and enhances the effectiveness of various GCL methods in GFSCIL.

Conclusion: MEGA serves as a model-agnostic GFSCIL paradigm that effectively alleviates catastrophic forgetting and paves the way for future research in graph few-shot class-incremental learning.

Abstract: Graph Few-Shot Class-Incremental Learning (GFSCIL) enables models to continually learn from limited samples of novel tasks after initial training on a large base dataset. Existing GFSCIL approaches typically utilize Prototypical Networks (PNs) for metric-based class representations and fine-tune the model during the incremental learning stage. However, these PN-based methods oversimplify learning via novel query set fine-tuning and fail to integrate Graph Continual Learning (GCL) techniques due to architectural constraints. To address these challenges, we propose a more rigorous and practical setting for GFSCIL that excludes query sets during the incremental training phase. Building on this foundation, we introduce Model-Agnostic Meta Graph Continual Learning (MEGA), aimed at effectively alleviating catastrophic forgetting for GFSCIL. Specifically, by calculating the incremental second-order gradient during the meta-training stage, we endow the model to learn high-quality priors that enhance incremental learning by aligning its behaviors across both the meta-training and incremental learning stages. Extensive experiments on four mainstream graph datasets demonstrate that MEGA achieves state-of-the-art results and enhances the effectiveness of various GCL methods in GFSCIL. We believe that our proposed MEGA serves as a model-agnostic GFSCIL paradigm, paving the way for future research.

[370] Always Skip Attention

Yiping Ji, Hemanth Saratchandran, Peyman Moghadam, Simon Lucey

Main category: cs.LG

TL;DR: Vision Transformers’ self-attention fails to train without skip connections, unlike other components or CNNs. The paper shows self-attention is fundamentally ill-conditioned and proposes Token Graying to improve conditioning.

DetailsMotivation: To understand why self-attention in Vision Transformers catastrophically fails to train without skip connections, unlike other architectural components or previous deep learning models like CNNs.

Method: Theoretical characterization of self-attention’s ill-conditioned nature, plus proposing Token Graying - a simple complement to skip connections that improves input token conditioning.

Result: Validated the approach in both supervised and self-supervised training methods, showing improved conditioning and performance.

Conclusion: Self-attention in ViTs is uniquely dependent on skip connections due to fundamental ill-conditioning, and Token Graying provides an effective complementary regularization technique.

Abstract: We highlight a curious empirical result within modern Vision Transformers (ViTs). Specifically, self-attention catastrophically fails to train unless it is used in conjunction with a skip connection. This is in contrast to other elements of a ViT that continue to exhibit good performance (albeit suboptimal) when skip connections are removed. Further, we show that this critical dependence on skip connections is a relatively new phenomenon, with previous deep architectures (\eg, CNNs) exhibiting good performance in their absence. In this paper, we theoretically characterize that the self-attention mechanism is fundamentally ill-conditioned and is, therefore, uniquely dependent on skip connections for regularization. Additionally, we propose Token Graying – a simple yet effective complement (to skip connections) that further improves the condition of input tokens. We validate our approach in both supervised and self-supervised training methods.

[371] Recipes for Pre-training LLMs with MXFP8

Asit Mishra, Dusan Stosic, Simon Layton, Paulius Micikevicius

Main category: cs.LG

TL;DR: Microscaling (MX) formats enable efficient 8-bit training that matches BF16 accuracy by combining narrow floating-point types with per-block scaling, tested on models up to 8B parameters.

DetailsMotivation: Improving GPU efficiency during pre-training by using fewer bits to represent model parameters without sacrificing accuracy, addressing the need for more efficient quantization techniques.

Method: Using MXFP8-E4M3 datatype with specific number conversion algorithm and careful parameter choices for microscaling formats, enabling quantization of more tensors and efficient execution.

Result: Training sessions using MX formats match those carried out in BF16, demonstrated on models with up to 8B parameters trained on datasets of up to 15T tokens.

Conclusion: Microscaling formats represent a major advancement in efficient model training, making 8-bit quantization practical while maintaining accuracy equivalent to BF16 precision.

Abstract: Using fewer bits to represent model parameters and related tensors during pre-training has become a required technique for improving GPU efficiency without sacrificing accuracy. Microscaling (MX) formats introduced in NVIDIA Blackwell generation of GPUs represent a major advancement of this technique, making it practical to combine narrow floating-point data types with finer granularity per-block scaling factors. In turn, this enables both quantization of more tensors than previous approaches and more efficient execution of operations on those tensors. Effective use of MX-formats requires careful choices of various parameters. In this paper we review these choices and show how MXFP8-E4M3 datatype and a specific number conversion algorithm result in training sessions that match those carried out in BF16. We present results using models with up to 8B parameters, trained on high-quality datasets of up to 15T tokens.

[372] Epistemic Wrapping for Uncertainty Quantification

Maryam Sultana, Neil Yorke-Smith, Kaizheng Wang, Shireen Kudukkil Manchingal, Muhammad Mubashar, Fabio Cuzzolin

Main category: cs.LG

TL;DR: Novel Epistemic Wrapping method improves uncertainty estimation in classification by transforming BNN outputs into belief function posteriors, enhancing generalization and uncertainty quantification across multiple datasets.

DetailsMotivation: Uncertainty estimation is crucial for improving robustness and reliability of machine learning models in classification tasks, but existing methods may not adequately capture epistemic uncertainty.

Method: Uses Bayesian Neural Networks as baseline and transforms their outputs into belief function posteriors to capture epistemic uncertainty more effectively. Tested with BNN baseline and Interval Neural Network on MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets.

Result: The Epistemic Wrapper significantly enhances generalization and uncertainty quantification across all tested datasets.

Conclusion: The proposed Epistemic Wrapping methodology provides an efficient and general approach for improved uncertainty estimation in classification tasks, effectively capturing epistemic uncertainty.

Abstract: Uncertainty estimation is pivotal in machine learning, especially for classification tasks, as it improves the robustness and reliability of models. We introduce a novel `Epistemic Wrapping’ methodology aimed at improving uncertainty estimation in classification. Our approach uses Bayesian Neural Networks (BNNs) as a baseline and transforms their outputs into belief function posteriors, effectively capturing epistemic uncertainty and offering an efficient and general methodology for uncertainty quantification. Comprehensive experiments employing a Bayesian Neural Network (BNN) baseline and an Interval Neural Network for inference on the MNIST, Fashion-MNIST, CIFAR-10 and CIFAR-100 datasets demonstrate that our Epistemic Wrapper significantly enhances generalisation and uncertainty quantification.

[373] ConTextTab: A Semantics-Aware Tabular In-Context Learner

Marco Spinaci, Marek Polewczyk, Maximilian Schambach, Sam Thelin

Main category: cs.LG

TL;DR: ConTextTab combines table-native ICL efficiency with LLM semantic understanding by using specialized embeddings and training on real-world tabular data, achieving SOTA performance across benchmarks.

DetailsMotivation: Current table-native ICL models are efficient but limited by synthetic training data, while LLM-based tabular ICL models have semantic understanding but poor context utilization. The goal is to combine both advantages.

Method: Integrates semantic understanding into table-native ICL framework using specialized embeddings for different data modalities and training on large-scale real-world tabular data.

Result: Competitive with state-of-the-art across broad benchmarks and sets new standard on semantically rich CARTE benchmark.

Conclusion: ConTextTab successfully bridges the gap between efficient table-native architectures and deep semantic understanding, demonstrating superior performance on real-world tabular tasks.

Abstract: Tabular in-context learning (ICL) has recently achieved state-of-the-art (SOTA) performance on several tabular prediction tasks. Previously restricted to classification problems on small tables, recent advances such as TabPFN and TabICL have extended its use to larger datasets. Although current table-native ICL architectures are architecturally efficient and well-adapted to tabular data structures, their exclusive training on synthetic data limits their ability to fully leverage the rich semantics and world knowledge contained in real-world tabular data. At the other end of the spectrum, tabular ICL models based on pretrained large language models such as TabuLa-8B integrate deep semantic understanding and world knowledge but are only able to make use of a small amount of context due to inherent architectural limitations. With the aim to combine the best of both these worlds, we introduce ConTextTab, integrating semantic understanding and alignment into a table-native ICL framework. By employing specialized embeddings for different data modalities and by training on large-scale real-world tabular data, our model is competitive with SOTA across a broad set of benchmarks while setting a new standard on the semantically rich CARTE benchmark. Code and model checkpoints are available at: https://github.com/SAP-samples/contexttab

[374] Good Things Come in Pairs: Paired Autoencoders for Inverse Problems

Matthias Chung, Bas Peters, Michael Solomon

Main category: cs.LG

TL;DR: The paired autoencoder framework combines data-driven and model-based methods for solving inverse problems by projecting data and quantities of interest into latent spaces and mapping between them, enabling high-quality reconstructions even with noisy data.

DetailsMotivation: To develop a powerful approach for solving inverse problems in scientific computing that leverages both data-driven and model-based methods, addressing challenges like data noise and uncertainty quantification.

Method: Uses a paired autoencoder framework that projects both input data and target quantities into latent spaces, creates surrogate forward and inverse mappings, and enables latent-space refinement to fit observed data. Also introduces variational variants for uncertainty analysis.

Result: Numerical experiments demonstrate high-quality estimates even when data noise exceeds training levels, with successful applications in seismic imaging and inpainting problems. The framework provides multiple reconstruction metrics and enables uncertainty analysis.

Conclusion: The paired autoencoder framework is an effective likelihood-free approach for inverse problems that combines data-driven and model-based advantages, provides reconstruction assessment metrics, and supports uncertainty quantification through novel variational variants.

Abstract: In this book chapter, we discuss recent advances in data-driven approaches for inverse problems. In particular, we focus on the \emph{paired autoencoder} framework, which has proven to be a powerful tool for solving inverse problems in scientific computing. The paired autoencoder framework is a novel approach that leverages the strengths of both data-driven and model-based methods by projecting both the data and the quantity of interest into a latent space and mapping these latent spaces to provide surrogate forward and inverse mappings. We illustrate the advantages of this approach through numerical experiments, including seismic imaging and classical inpainting: nonlinear and linear inverse problems, respectively. Although the paired autoencoder framework is likelihood-free, it generates multiple data- and model-based reconstruction metrics that help assess whether examples are in or out of distribution. In addition to direct model estimates from data, the paired autoencoder enables latent-space refinement to fit the observed data accurately. Numerical experiments show that this procedure, combined with the latent-space initial guess, is essential for high-quality estimates, even when data noise exceeds the training regime. We also introduce two novel variants that combine variational and paired autoencoder ideas, maintaining the original benefits while enabling sampling for uncertainty analysis.

[375] Bidirectional Information Flow (BIF) – A Sample Efficient Hierarchical Gaussian Process for Bayesian Optimization

Juan D. Guerra, Thomas Garbay, Guillaume Lajoie, Marco Bonizzato

Main category: cs.LG

TL;DR: Bidirectional Information Flow (BIF) framework enhances Hierarchical Gaussian Processes by enabling two-way information exchange between parent and child models, significantly improving sample efficiency and performance in online learning scenarios.

DetailsMotivation: Traditional H-GP models only allow one-way information flow (either bottom-up or top-down), which limits sample efficiency and slows convergence. The authors aim to leverage the full hierarchical structure by establishing bidirectional communication.

Method: Proposed BIF framework that maintains modular hierarchical structure while introducing top-down feedback to continually refine child models during online learning, enabling mutual information exchange between parent and children GPs.

Result: BIF outperforms conventional H-GP Bayesian Optimization methods, achieving up to 4x higher R² scores for parent models and 3x higher for children models on both synthetic and real-world neurostimulation optimization tasks.

Conclusion: The bidirectional information exchange in H-GPs significantly improves sample efficiency, enables robust training, and allows modular reuse of learned subtask models, making it superior to traditional one-way coupling approaches.

Abstract: Hierarchical Gaussian Process (H-GP) models divide problems into different subtasks, allowing for different models to address each part, making them well-suited for problems with inherent hierarchical structure. However, typical H-GP models do not fully take advantage of this structure, only sending information up or down the hierarchy. This one-way coupling limits sample efficiency and slows convergence. We propose Bidirectional Information Flow (BIF), an efficient H-GP framework that establishes bidirectional information exchange between parent and child models in H-GPs for online training. BIF retains the modular structure of hierarchical models - the parent combines subtask knowledge from children GPs - while introducing top-down feedback to continually refine children models during online learning. This mutual exchange improves sample efficiency, enables robust training, and allows modular reuse of learned subtask models. BIF outperforms conventional H-GP Bayesian Optimization methods, achieving up to 4x and 3x higher $R^2$ scores for the parent and children respectively, on synthetic and real-world neurostimulation optimization tasks.

[376] Flexible Operator Fusion for Fast Sparse Transformer with Diverse Masking on GPU

Wenhao Dai, Haodong Deng, Mengfei Rong, Xinyu Yang, Hongyu Liu, Fangxin Liu, Hailong Yang, Qianwen Cao, Qingxiao Sun

Main category: cs.LG

TL;DR: STOF is a framework that optimizes Sparse Transformer performance on GPUs through flexible masking and operator fusion, achieving up to 1.7x speedup in MHA computation and 1.5x in end-to-end inference.

DetailsMotivation: Previous works rarely focus on performance optimization of sparse Transformers, and rule-based mechanisms ignore fusion opportunities of mixed-type operators while failing to adapt to various sequence lengths.

Method: Unify storage format and kernel implementation for multi-head attention, map fusion schemes to compilation templates, and use a two-stage search engine to determine optimal parameter settings.

Result: STOF achieves maximum speedups of 1.7x in MHA computation and 1.5x in end-to-end inference compared to state-of-the-art work.

Conclusion: The proposed STOF framework effectively addresses performance optimization challenges in sparse Transformers through systematic kernel unification and intelligent fusion scheme selection.

Abstract: Large language models are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topic. Mask layers introduce sparsity into Transformer to reduce calculations. However, previous works rarely focus on the performance optimization of sparse Transformer. Moreover, rule-based mechanisms ignore the fusion opportunities of mixed-type operators and fail to adapt to various sequence lengths. To address the above problems, we propose STOF, a framework that incorporates optimizations for Sparse Transformer via flexible masking and operator fusion on GPU. We firstly unify the storage format and kernel implementation for the multi-head attention. Then, we map fusion schemes to compilation templates and determine the optimal parameter setting through a two-stage search engine. The experimental results show that compared to the state-of-the-art work, STOF achieves maximum speedups of 1.7x in MHA computation and 1.5x in end-to-end inference.

[377] Tensor Program Optimization for the RISC-V Vector Extension Using Probabilistic Programs

Federico Nicolas Peccia, Frederik Haxel, Oliver Bringmann

Main category: cs.LG

TL;DR: TVM compiler workflow with RISC-V Vector Extension integration achieves 46% faster execution than GCC autovectorization and 29% better than muRISCV-NN, with smaller code footprint.

DetailsMotivation: RISC-V RVV extension shows promise for AI acceleration, but lacks efficient autotuning frameworks, requiring programmers to rely on limited compiler autovectorization or hand-crafted libraries.

Method: Integrated RISC-V RVV extension into TVM’s MetaSchedule framework, a probabilistic program framework for tensor operation tuning, and implemented different RISC-V SoCs on FPGA for evaluation.

Result: 46% mean improvement in execution latency vs GCC autovectorization, 29% vs muRISCV-NN, smaller code memory footprint, and 35% faster than LLVM on commercial RISC-V SoC.

Conclusion: The TVM-based workflow successfully enables efficient mapping of AI workloads onto RISC-V vector units without hand-crafted libraries, demonstrating significant performance improvements and suitability for embedded devices.

Abstract: RISC-V provides a flexible and scalable platform for applications ranging from embedded devices to high-performance computing clusters. Particularly, its RISC-V Vector Extension (RVV) becomes of interest for the acceleration of AI workloads. But writing software that efficiently utilizes the vector units of RISC-V CPUs without expert knowledge requires the programmer to rely on the autovectorization features of compilers or hand-crafted libraries like muRISCV-NN. Smarter approaches, like autotuning frameworks, have been missing the integration with the RISC-V RVV extension, thus heavily limiting the efficient deployment of complex AI workloads. In this paper, we present a workflow based on the TVM compiler to efficiently map AI workloads onto RISC-V vector units. Instead of relying on hand-crafted libraries, we integrated the RVV extension into TVM’s MetaSchedule framework, a probabilistic program framework for tensor operation tuning. We implemented different RISC-V SoCs on an FPGA and tuned a wide range of AI workloads on them. We found that our proposal shows a mean improvement of 46% in execution latency when compared against the autovectorization feature of GCC, and 29% against muRISCV-NN. Moreover, the binary resulting from our proposal has a smaller code memory footprint, making it more suitable for embedded devices. Finally, we also evaluated our solution on a commercially available RISC-V SoC implementing the RVV 1.0 Vector Extension and found our solution is able to find mappings that are 35% faster on average than the ones proposed by LLVM. We open-sourced our proposal for the community to expand it to target other RISC-V extensions.

[378] SymMatika: Structure-Aware Symbolic Discovery

Michael Scherk, Boyuan Chen

Main category: cs.LG

TL;DR: SymMatika is a hybrid symbolic regression algorithm that combines genetic programming with reusable motif libraries to discover both explicit and implicit mathematical expressions from data, achieving state-of-the-art performance on benchmark problems.

DetailsMotivation: Existing symbolic regression methods either focus on explicit mappings or implicit relations, but few support both. Most approaches treat expression candidates in isolation without reusing recurring structural patterns that could accelerate search.

Method: SymMatika uses multi-island genetic programming combined with a reusable motif library inspired by biological sequence analysis. It identifies high-impact substructures in top-performing candidates and reintroduces them to guide future generations, incorporating a feedback-driven evolutionary engine and supporting both explicit and implicit relation discovery using implicit-derivative metrics.

Result: SymMatika achieves state-of-the-art recovery rates on Nguyen and Feynman benchmark suites, with an impressive 61% recovery rate on Nguyen-12 (compared to next best 2%), and strong placement on error-complexity Pareto fronts on Feynman equations and 57 SRBench Black-box problems.

Conclusion: The results demonstrate the power of structure-aware evolutionary search for scientific discovery. The full SymMatika framework has been open-sourced to support broader research in interpretable modeling and symbolic discovery.

Abstract: Symbolic regression (SR) seeks to recover closed-form mathematical expressions that describe observed data. While existing methods have advanced the discovery of either explicit mappings (i.e., $y = f(\mathbf{x})$) or discovering implicit relations (i.e., $F(\mathbf{x}, y)=0$), few modern and accessible frameworks support both. Moreover, most approaches treat each expression candidate in isolation, without reusing recurring structural patterns that could accelerate search. We introduce SymMatika, a hybrid SR algorithm that combines multi-island genetic programming (GP) with a reusable motif library inspired by biological sequence analysis. SymMatika identifies high-impact substructures in top-performing candidates and reintroduces them to guide future generations. Additionally, it incorporates a feedback-driven evolutionary engine and supports both explicit and implicit relation discovery using implicit-derivative metrics. Across benchmarks, SymMatika achieves state-of-the-art recovery rates on the Nguyen and Feynman benchmark suites, an impressive recovery rate of 61% on Nguyen-12 compared to the next best 2%, and strong placement on the error-complexity Pareto fronts on the Feynman equations and on a subset of 57 SRBench Black-box problems. Our results demonstrate the power of structure-aware evolutionary search for scientific discovery. To support broader research in interpretable modeling and symbolic discovery, we have open-sourced the full SymMatika framework.

[379] Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation

Wenhao Li, Xiu Su, Jingyi Wu, Feng Yang, Yang Liu, Yi Chen, Shan You, Chang Xu

Main category: cs.LG

TL;DR: SEED (Self-Evolving Distillation) is a novel method that identifies and purges hallucinations within LVLMs’ internal knowledge, then distills purified knowledge back into the model through mode-seeking distillation and a Hallucination Elimination Adapter.

DetailsMotivation: Large Vision-Language Models suffer from hallucination issues that limit their credibility and application potential. Existing mitigation methods rely on external tools or multi-round inference, which significantly increase inference time.

Method: Proposes SEED framework that: 1) identifies hallucinations within LVLMs’ inner knowledge, 2) isolates and purges them, 3) uses Mode-Seeking Evolving approach to distill purified knowledge while avoiding void spaces, 4) implements Hallucination Elimination Adapter to correct dark knowledge.

Result: Extensive experiments show substantial improvements in mitigating hallucinations. LLaVA-1.5’s F1 score on POPE-Random improved from 81.3 to 88.3. Demonstrated effectiveness across multiple benchmarks for models like LLaVA-1.5 and InternVL2.

Conclusion: SEED provides an effective self-evolution framework that significantly reduces hallucinations in LVLMs without relying on external tools or increasing inference time, making LVLMs more credible and practical for real-world applications.

Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable advancements in numerous areas such as multimedia. However, hallucination issues significantly limit their credibility and application potential. Existing mitigation methods typically rely on external tools or the comparison of multi-round inference, which significantly increase inference time. In this paper, we propose \textbf{SE}lf-\textbf{E}volving \textbf{D}istillation (\textbf{SEED}), which identifies hallucinations within the inner knowledge of LVLMs, isolates and purges them, and then distills the purified knowledge back into the model, enabling self-evolution. Furthermore, we identified that traditional distillation methods are prone to inducing void spaces in the output space of LVLMs. To address this issue, we propose a Mode-Seeking Evolving approach, which performs distillation to capture the dominant modes of the purified knowledge distribution, thereby avoiding the chaotic results that could emerge from void spaces. Moreover, we introduce a Hallucination Elimination Adapter, which corrects the dark knowledge of the original model by learning purified knowledge. Extensive experiments on multiple benchmarks validate the superiority of our SEED, demonstrating substantial improvements in mitigating hallucinations for representative LVLM models such as LLaVA-1.5 and InternVL2. Remarkably, the F1 score of LLaVA-1.5 on the hallucination evaluation metric POPE-Random improved from 81.3 to 88.3.

[380] Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data

Jeonghye Kim, Yongjae Shin, Whiyoung Jung, Sunghoon Hong, Deunsol Yoon, Youngchul Sung, Kanghoon Lee, Woohyung Lim

Main category: cs.LG

TL;DR: PARS algorithm addresses Q-value extrapolation errors in offline RL by combining reward scaling with layer normalization and penalization of infeasible actions, achieving state-of-the-art performance on D4RL benchmark.

DetailsMotivation: Offline reinforcement learning suffers from Q-value extrapolation errors when the Q-function is extended beyond the available data range, which can lead to poor performance and instability.

Method: Proposes PARS algorithm that combines two techniques: 1) Reward Scaling with Layer Normalization (RS-LN) to guide gradual decrease of Q-values outside data range, and 2) Penalization mechanism for infeasible actions (PA) to prevent overestimation of impossible actions.

Result: Demonstrates superior performance compared to state-of-the-art algorithms on D4RL benchmark, particularly excelling in the challenging AntMaze Ultra task, in both offline training and online fine-tuning scenarios.

Conclusion: The PARS algorithm effectively mitigates Q-value extrapolation errors in offline RL through careful guidance of Q-value behavior outside the data distribution, providing a robust solution for offline-to-online reinforcement learning.

Abstract: Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS across a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning on the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.

[381] Spatial-Temporal Transformer with Curriculum Learning for EEG-Based Emotion Recognition

Xuetao Lin, Tianhao Peng, Peihong Dai, Yu Liang, Wenjun Wu

Main category: cs.LG

TL;DR: SST-CL framework combines spatial-temporal transformers with curriculum learning for EEG-based emotion recognition, achieving state-of-the-art performance by effectively modeling spatial-temporal neural patterns and adapting to emotional intensity variations.

DetailsMotivation: EEG-based emotion recognition faces challenges in integrating non-stationary spatial-temporal neural patterns and adapting to dynamic emotional intensity variations in real-world scenarios.

Method: Proposes SST-CL framework with spatial encoder for inter-channel relationships and temporal encoder with windowed attention for multi-scale dependencies, plus intensity-aware curriculum learning with dynamic sample scheduling based on dual difficulty assessment.

Result: Achieves state-of-the-art performance across various emotional intensity levels on three benchmark datasets, with ablation studies confirming the necessity of both architectural components and curriculum learning.

Conclusion: The proposed SST-CL framework effectively addresses key challenges in EEG emotion recognition by integrating spatial-temporal transformers with curriculum learning, demonstrating robust performance across intensity variations.

Abstract: EEG-based emotion recognition plays an important role in developing adaptive brain-computer communication systems, yet faces two fundamental challenges in practical implementations: (1) effective integration of non-stationary spatial-temporal neural patterns, (2) robust adaptation to dynamic emotional intensity variations in real-world scenarios. This paper proposes SST-CL, a novel framework integrating spatial-temporal transformers with curriculum learning. Our method introduces two core components: a spatial encoder that models inter-channel relationships and a temporal encoder that captures multi-scale dependencies through windowed attention mechanisms, enabling simultaneous extraction of spatial correlations and temporal dynamics from EEG signals. Complementing this architecture, an intensity-aware curriculum learning strategy progressively guides training from high-intensity to low-intensity emotional states through dynamic sample scheduling based on a dual difficulty assessment. Comprehensive experiments on three benchmark datasets demonstrate state-of-the-art performance across various emotional intensity levels, with ablation studies confirming the necessity of both architectural components and the curriculum learning mechanism.

[382] PinFM: Foundation Model for User Activity Sequences at a Billion-scale Visual Discovery Platform

Xiangyi Chen, Kousik Rajesh, Matthew Lawhon, Zelun Wang, Hanyu Li, Haomiao Li, Saurabh Vishwas Joshi, Pong Eksombatchai, Jaewon Yang, Yi-Ping Hsu, Jiajing Xu, Charles Rosenberg

Main category: cs.LG

TL;DR: PinFM is a billion-parameter transformer model pretrained on user activity sequences for recommender systems, achieving 600% throughput improvement and 20% engagement increase with new items.

DetailsMotivation: User activity sequences are critical signals in recommender systems, but applying pretraining-fine-tuning approaches from NLP/Vision to industrial systems faces scalability, cost, latency, and cold-start challenges.

Method: Pretrained transformer with 20B+ parameters using extensive user activity data, fine-tuned for specific applications. Developed Deduplicated Cross-Attention Transformer (DCAT) and infrastructure optimizations to handle millions of items per second with tight constraints.

Result: 600% throughput improvement on Pinterest data, 20% increase in engagement with new items by learning interactions between user sequences and candidate items through input sequence alterations.

Conclusion: PinFM successfully addresses industrial recommender system challenges and is deployed to improve experience for over half billion users across various applications, demonstrating the viability of foundational models in production recommendation systems.

Abstract: User activity sequences have emerged as one of the most important signals in recommender systems. We present a foundational model, PinFM, for understanding user activity sequences across multiple applications at a billion-scale visual discovery platform. We pretrain a transformer model with 20B+ parameters using extensive user activity data, then fine-tune it for specific applications, efficiently coupling it with existing models. While this pretraining-and-fine-tuning approach has been popular in other domains, such as Vision and NLP, its application in industrial recommender systems presents numerous challenges. The foundational model must be scalable enough to score millions of items every second while meeting tight cost and latency constraints imposed by these systems. Additionally, it should capture the interactions between user activities and other features and handle new items that were not present during the pretraining stage. We developed innovative techniques to address these challenges. Our infrastructure and algorithmic optimizations, such as the Deduplicated Cross-Attention Transformer (DCAT), improved our throughput by 600% on Pinterest internal data. We demonstrate that PinFM can learn interactions between user sequences and candidate items by altering input sequences, leading to a 20% increase in engagement with new items. PinFM is now deployed to help improve the experience of more than a half billion users across various applications.

[383] Improving DAPO from a Mixed-Policy Perspective

Hongze Tan, Yuchen Li

Main category: cs.LG

TL;DR: Two novel modifications to DAPO algorithm using mixed-policy approach: 1) incorporating pre-trained guiding policy for off-policy experience to improve stability, 2) reusing zero-reward samples guided by expert policy to enhance sample efficiency.

DetailsMotivation: Standard policy gradient methods suffer from instability and sample inefficiency, especially in sparse reward settings. The paper aims to address these limitations through mixed-policy approaches.

Method: 1) Incorporates pre-trained stable guiding policy (πφ) to provide off-policy experience and regularize target policy (πon) training with adaptive learning step size. 2) Re-utilizes zero-reward samples as distinct batch guided by expert policy instead of discarding them.

Result: Theoretical analysis shows both methods’ objective functions converge to optimal solution within RL framework. The mixed-policy approach improves training stability, convergence speed, and sample efficiency.

Conclusion: The proposed mixed-policy framework effectively balances exploration and exploitation, enabling more stable and efficient policy optimization in reinforcement learning.

Abstract: This paper introduces two novel modifications to the Dynamic sAmpling Policy Optimization (DAPO) algorithm [1], approached from a mixed-policy perspective. Standard policy gradient methods can suffer from instability and sample inefficiency, particularly in sparse reward settings. To address this, we first propose a method that incorporates a pre-trained, stable guiding policy ($\piphi$) to provide off-policy experience, thereby regularizing the training of the target policy ($\pion$). This approach improves training stability and convergence speed by adaptively adjusting the learning step size. Secondly, we extend this idea to re-utilize zero-reward samples, which are often discarded by dynamic sampling strategies like DAPO’s. By treating these samples as a distinct batch guided by the expert policy, we further enhance sample efficiency. We provide a theoretical analysis for both methods, demonstrating that their objective functions converge to the optimal solution within the established theoretical framework of reinforcement learning. The proposed mixed-policy framework effectively balances exploration and exploitation, promising more stable and efficient policy optimization.

[384] Dynamics-Informed Reservoir Computing with Visibility Graphs

Charlotte Geier, Rasha Shanaz, Merten Stender

Main category: cs.LG

TL;DR: Proposes Dynamics-Informed Reservoir Computing (DyRC) using visibility graphs to construct reservoir networks directly from training data, improving prediction accuracy and consistency compared to random reservoir architectures.

DetailsMotivation: Traditional reservoir computing uses random reservoir networks that are often suboptimal and oversized with poorly understood dynamics, leading to inconsistent performance in complex time series prediction.

Method: Uses visibility graph (VG) technique to convert time series data into networks, then constructs reservoir network directly from the VG structure of training data, avoiding expensive hyperparameter tuning.

Result: DyRC-VG shows higher prediction quality and more consistent performance compared to Erdős-Rényi graphs of same size/spectral radius/density in nonlinear Duffing oscillator prediction tasks.

Conclusion: Constructing reservoir networks directly from training data dynamics using visibility graphs provides more optimal and consistent performance than random reservoir architectures in complex time series prediction.

Abstract: Accurate prediction of complex and nonlinear time series remains a challenging problem across engineering and scientific disciplines. Reservoir computing (RC) offers a computationally efficient alternative to traditional deep learning by training only the read-out layer while employing a randomly structured and fixed reservoir network. Despite its advantages, the largely random reservoir graph architecture often results in suboptimal and oversized networks with poorly understood dynamics. Addressing this issue, we propose a novel Dynamics-Informed Reservoir Computing (DyRC) framework that systematically infers the reservoir network structure directly from the input training sequence. This work proposes to employ the visibility graph (VG) technique, which converts time series data into networks by representing measurement points as nodes linked by mutual visibility. The reservoir network is constructed by directly adopting the VG network from a training data sequence, leveraging the parameter-free visibility graph approach to avoid expensive hyperparameter tuning. This process results in a reservoir that is directly informed by the specific dynamics of the prediction task under study. We assess the DyRC-VG method through prediction tasks involving the canonical nonlinear Duffing oscillator, evaluating prediction accuracy and consistency. Compared to an Erd\H{o}s-R'enyi (ER) graph of the same size, spectral radius, and fixed density, we observe higher prediction quality and more consistent performance over repeated implementations in the DyRC-VG. An ER graph with density matched to the DyRC-VG can in some conditions outperform both approaches.

[385] FlowState: Sampling Rate Invariant Time Series Forecasting

Lars Graf, Thomas Ortner, Stanisław Woźniak, Angeliki Pantazi

Main category: cs.LG

TL;DR: FlowState is a novel time series foundation model that uses state space model encoder and functional basis decoder to enable continuous-time modeling and dynamic time-scale adjustment, achieving state-of-the-art performance with smaller size and better efficiency.

DetailsMotivation: Existing time series foundation models based on transformers struggle with generalization across varying context/target lengths, lack adaptability to different sampling rates, and are computationally inefficient.

Method: Uses state space model (SSM) based encoder and functional basis decoder for continuous-time modeling and dynamic time-scale adjustment. Includes efficient pretraining strategy for improved robustness.

Result: Outperforms all other models on GIFT-ZS and Chronos-ZS benchmarks despite being the smallest model. Shows unique ability to adapt online to varying input sampling rates.

Conclusion: FlowState addresses key limitations of existing TSFMs through its innovative architecture, enabling better generalization, reduced data requirements, and improved computational efficiency while achieving state-of-the-art performance.

Abstract: Foundation models (FMs) have transformed natural language processing, but their success has not yet translated to time series forecasting. Existing time series foundation models (TSFMs), often based on transformer variants, struggle with generalization across varying context and target lengths, lack adaptability to different sampling rates, and are computationally inefficient. We introduce FlowState, a novel TSFM architecture that addresses these challenges through two key innovations: a state space model (SSM) based encoder and a functional basis decoder. This design enables continuous-time modeling and dynamic time-scale adjustment, allowing FlowState to inherently generalize across all possible temporal resolutions, and dynamically adjust the forecasting horizons. In contrast to other state-of-the-art TSFMs, which require training data across all possible sampling rates to memorize patterns at each scale, FlowState inherently adapts its internal dynamics to the input scale, enabling smaller models, reduced data requirements, and improved efficiency. We further propose an efficient pretraining strategy that improves robustness and accelerates training. Despite being the smallest model, FlowState outperforms all other models and is state-of-the-art for the GIFT-ZS and the Chronos-ZS benchmarks. Ablation studies confirm the effectiveness of its components, and we demonstrate its unique ability to adapt online to varying input sampling rates.

[386] Clus-UCB: A Near-Optimal Algorithm for Clustered Bandits

Aakash Gore, Prasanna Chaporkar

Main category: cs.LG

TL;DR: A novel multi-armed bandit algorithm called Clus-UCB that exploits known cluster structures to improve regret bounds and performance by sharing information among arms within clusters.

DetailsMotivation: Traditional bandit algorithms don't leverage known clustering structures where arms within clusters have similar mean rewards. This work aims to exploit this structure to achieve better regret bounds and performance.

Method: Proposed Clus-UCB algorithm that introduces a new index for evaluating arms that depends on other arms within the same cluster, enabling information sharing among clustered arms.

Result: Derived an improved asymptotic lower bound on regret compared to classical Lai & Robbins bound. Clus-UCB closely matches this bound asymptotically and outperforms KL-UCB and other algorithms in simulations.

Conclusion: The clustering structure can be effectively exploited to improve bandit algorithm performance. Future research should address current limitations and explore additional applications of cluster-aware bandit algorithms.

Abstract: We study a stochastic multi-armed bandit setting where arms are partitioned into known clusters, such that the mean rewards of arms within a cluster differ by at most a known threshold. While the clustering structure is known a priori, the arm means are unknown. We derive an asymptotic lower bound on the regret that improves upon the classical bound of Lai & Robbins (1985). We then propose Clus-UCB, an efficient algorithm that closely matches this lower bound asymptotically. Clus-UCB is designed to exploit the clustering structure and introduces a new index to evaluate an arm, which depends on other arms within the cluster. In this way, arms share information among each other. We present simulation results of our algorithm and compare its performance against KL-UCB and other wellknown algorithms for bandits with dependent arms. Finally, we address some limitations of this work and conclude by mentioning some possible future research.

[387] Early Detection of Pancreatic Cancer Using Multimodal Learning on Electronic Health Records

Mosbah Aouad, Anirudh Choudhary, Awais Farooq, Steven Nevers, Lusine Demirkhanyan, Bhrandon Harris, Suguna Pappu, Christopher Gondi, Ravishankar Iyer

Main category: cs.LG

TL;DR: Multimodal deep learning approach combining lab time series and diagnosis codes for early pancreatic cancer detection up to 1 year before clinical diagnosis, achieving 6.5-15.5% AUC improvement over state-of-the-art methods.

DetailsMotivation: Pancreatic ductal adenocarcinoma (PDAC) has high mortality due to late detection, with no reliable biomarkers or specific symptoms for early diagnosis.

Method: Combines neural controlled differential equations for irregular lab time series, pretrained language models and recurrent networks for diagnosis code trajectories, and cross-attention mechanisms for multimodal integration.

Result: Achieved significant AUC improvements (6.5-15.5%) over state-of-the-art methods on 4,700-patient dataset, identifying both established and new biomarkers for PDAC risk.

Conclusion: The multimodal approach enables early PDAC detection up to one year before clinical diagnosis and provides interpretable risk factors, potentially improving patient outcomes through earlier intervention.

Abstract: Pancreatic ductal adenocarcinoma (PDAC) is one of the deadliest cancers, and early detection remains a major clinical challenge due to the absence of specific symptoms and reliable biomarkers. In this work, we propose a new multimodal approach that integrates longitudinal diagnosis code histories and routinely collected laboratory measurements from electronic health records to detect PDAC up to one year prior to clinical diagnosis. Our method combines neural controlled differential equations to model irregular lab time series, pretrained language models and recurrent networks to learn diagnosis code trajectory representations, and cross-attention mechanisms to capture interactions between the two modalities. We develop and evaluate our approach on a real-world dataset of nearly 4,700 patients and achieve significant improvements in AUC ranging from 6.5% to 15.5% over state-of-the-art methods. Furthermore, our model identifies diagnosis codes and laboratory panels associated with elevated PDAC risk, including both established and new biomarkers. Our code is available at https://github.com/MosbahAouad/EarlyPDAC-MML.

[388] Towards Theoretical Understanding of Transformer Test-Time Computing: Investigation on In-Context Linear Regression

Xingwu Chen, Miao Lu, Beining Wu, Difan Zou

Main category: cs.LG

TL;DR: This paper develops a theoretical framework to analyze language model inference techniques that use more test-time computation, focusing on in-context linear regression with noise injection and sampling.

DetailsMotivation: To bridge the gap between practical language model inference (using techniques like generating multiple thoughts or sampling answers) and theoretical transformer analysis by incorporating randomness and sampling into the analytical framework.

Method: Develops a theoretical framework that simulates language model decoding through noise injection and binary coefficient sampling, focusing specifically on in-context linear regression with continuous/binary coefficients.

Result: The framework provides detailed analyses of widely adopted inference techniques, with empirical results supporting the theoretical findings.

Conclusion: The theoretical framework demonstrates potential for offering new insights into understanding inference behaviors in real-world language models, bridging practical inference methods with theoretical analysis.

Abstract: Using more test-time computation during language model inference, such as generating more intermediate thoughts or sampling multiple candidate answers, has proven effective in significantly improving model performance. This paper takes an initial step toward bridging the gap between practical language model inference and theoretical transformer analysis by incorporating randomness and sampling. We focus on in-context linear regression with continuous/binary coefficients, where our framework simulates language model decoding through noise injection and binary coefficient sampling. Through this framework, we provide detailed analyses of widely adopted inference techniques. Supported by empirical results, our theoretical framework and analysis demonstrate the potential for offering new insights into understanding inference behaviors in real-world language models.

[389] To Theoretically Understand Transformer-Based In-Context Learning for Optimizing CSMA

Shugang Hao, Hongbo Li, Lingjie Duan

Main category: cs.LG

TL;DR: LLM transformer-based in-context learning approach for optimizing WiFi channel access, achieving near-optimal throughput under unknown node densities.

DetailsMotivation: Traditional binary exponential backoff and model-based approaches perform poorly under dynamic channel environments due to inaccurate node density estimation, leading to significant throughput loss.

Method: Transformer-based ICL optimizer that collects collision-threshold data examples and query collision cases as prompts, then generates predicted contention window thresholds. Includes training algorithm for near-optimal prediction and handles erroneous data inputs.

Result: Proven minimal prediction and throughput deviations from optimal values. Experimental results show fast convergence and near-optimal throughput outperforming existing model-based and DRL-based approaches.

Conclusion: Transformer-based ICL provides an effective solution for WiFi channel access optimization that handles unknown node densities and imperfect data, achieving superior performance over traditional methods.

Abstract: The binary exponential backoff scheme is widely used in WiFi 7 and still incurs poor throughput performance under dynamic channel environments. Recent model-based approaches (e.g., non-persistent and $p$-persistent CSMA) simply optimize backoff strategies under a known and fixed node density, still leading to a large throughput loss due to inaccurate node density estimation. This paper is the first to propose LLM transformer-based in-context learning (ICL) theory for optimizing channel access. We design a transformer-based ICL optimizer to pre-collect collision-threshold data examples and a query collision case. They are constructed as a prompt as the input for the transformer to learn the pattern, which then generates a predicted contention window threshold (CWT). To train the transformer for effective ICL, we develop an efficient algorithm and guarantee a near-optimal CWT prediction within limited training steps. As it may be hard to gather perfect data examples for ICL in practice, we further extend to allow erroneous data input in the prompt. We prove that our optimizer maintains minimal prediction and throughput deviations from the optimal values. Experimental results on NS-3 further demonstrate our approach’s fast convergence and near-optimal throughput over existing model-based and DRL-based approaches under unknown node densities.

[390] Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks

Bing Han, Feifei Zhao, Dongcheng Zhao, Guobin Shen, Ping Wu, Yu Shi, Yi Zeng

Main category: cs.LG

TL;DR: FGSN method uses fine-grained safety neuron identification and projection to reduce safety risks in fine-tuned LLMs while preserving utility.

DetailsMotivation: Existing post-fine-tuning defenses rely on coarse-grained safety layer mapping, lacking comprehensive consideration of both safety layers and fine-grained neurons, limiting their ability to balance safety and utility effectively.

Method: Propose Fine-Grained Safety Neurons (FGSN) with Training-Free Continual Projection method that integrates multi-scale interactions between safety layers and neurons, localizes precise safety neurons, and projects parameters onto safety directions while minimizing interference with downstream tasks.

Result: Extensive experiments show significant reduction in harmfulness scores and attack success rates with minimal parameter modifications while preserving model utility. Achieves continual defense and generalization against unforeseen safety concerns.

Conclusion: FGSN provides an effective approach for maintaining safety in fine-tuned LLMs through fine-grained neuron analysis and projection, offering better safety-utility balance than coarse-grained methods.

Abstract: Fine-tuning as service injects domain-specific knowledge into large language models (LLMs), while challenging the original alignment mechanisms and introducing safety risks. A series of defense strategies have been proposed for the alignment, fine-tuning, and post-fine-tuning phases, where most post-fine-tuning defenses rely on coarse-grained safety layer mapping. These methods lack a comprehensive consideration of both safety layers and fine-grained neurons, limiting their ability to efficiently balance safety and utility. To address this, we propose the Fine-Grained Safety Neurons (FGSN) with Training-Free Continual Projection method to reduce the fine-tuning safety risks. FGSN inherently integrates the multi-scale interactions between safety layers and neurons, localizing sparser and more precise fine-grained safety neurons while minimizing interference with downstream task neurons. We then project the safety neuron parameters onto safety directions, improving model safety while aligning more closely with human preferences. Extensive experiments across multiple fine-tuned LLM models demonstrate that our method significantly reduce harmfulness scores and attack success rates with minimal parameter modifications, while preserving the model’s utility. Furthermore, by introducing a task-specific, multi-dimensional heterogeneous safety neuron cluster optimization mechanism, we achieve continual defense and generalization capability against unforeseen emerging safety concerns.

[391] An Explainable AI based approach for Monitoring Animal Health

Rahul Jana, Shubham Dixit, Mrityunjay Sharma, Ritesh Kumar

Main category: cs.LG

TL;DR: This paper presents an explainable machine learning system using IoT sensors and accelerometer data to monitor dairy cattle health and activity, achieving high accuracy with k-nearest neighbor classifier and using SHAP for model interpretability.

DetailsMotivation: Dairy farmers face challenges in tracking cattle health and optimizing yield due to difficulties in monitoring all animals on farms, necessitating modern data-driven solutions.

Method: Uses Bluetooth IoT devices with 3-axis accelerometer sensors and 4G networks for data transmission. Applies signal processing, statistical feature extraction, and sliding window techniques. Evaluates hyperparameter-optimized ML models with various window lengths for activity classification.

Result: K-nearest neighbor classifier achieved best performance with AUC of 0.98 (mean) and 0.0026 standard deviation on training set, and 0.99 on testing set. SHAP framework provided feature interpretability for practitioners.

Conclusion: The study demonstrates successful development of explainable and practical ML models for sustainable livestock management through robust data processing, high-accuracy classification, and transparent feature interpretation.

Abstract: Monitoring cattle health and optimizing yield are key challenges faced by dairy farmers due to difficulties in tracking all animals on the farm. This work aims to showcase modern data-driven farming practices based on explainable machine learning(ML) methods that explain the activity and behaviour of dairy cattle (cows). Continuous data collection of 3-axis accelerometer sensors and usage of robust ML methodologies and algorithms, provide farmers and researchers with actionable information on cattle activity, allowing farmers to make informed decisions and incorporate sustainable practices. This study utilizes Bluetooth-based Internet of Things (IoT) devices and 4G networks for seamless data transmission, immediate analysis, inference generation, and explains the models performance with explainability frameworks. Special emphasis is put on the pre-processing of the accelerometers time series data, including the extraction of statistical characteristics, signal processing techniques, and lag-based features using the sliding window technique. Various hyperparameter-optimized ML models are evaluated across varying window lengths for activity classification. The k-nearest neighbour Classifier achieved the best performance, with AUC of mean 0.98 and standard deviation of 0.0026 on the training set and 0.99 on testing set). In order to ensure transparency, Explainable AI based frameworks such as SHAP is used to interpret feature importance that can be understood and used by practitioners. A detailed comparison of the important features, along with the stability analysis of selected features, supports development of explainable and practical ML models for sustainable livestock management.

[392] FreeGAD: A Training-Free yet Effective Approach for Graph Anomaly Detection

Yunfeng Zhao, Yixin Liu, Shiyuan Li, Qingfeng Chen, Yu Zheng, Shirui Pan

Main category: cs.LG

TL;DR: FreeGAD is a training-free graph anomaly detection method that uses affinity-gated residual encoding and anchor-guided statistical deviations to achieve high performance without complex training processes.

DetailsMotivation: Existing deep learning-based graph anomaly detection methods suffer from high deployment costs and poor scalability due to complex training processes, despite empirical evidence suggesting training contributes less to performance than expected.

Method: Uses affinity-gated residual encoder to generate anomaly-aware representations, identifies anchor nodes as pseudo-normal/anomalous guides, and calculates anomaly scores through anchor-guided statistical deviations without any training.

Result: Extensive experiments show FreeGAD achieves superior anomaly detection performance, efficiency, and scalability on multiple benchmark datasets from diverse domains.

Conclusion: Training-free approaches like FreeGAD can effectively detect graph anomalies while avoiding the computational costs and scalability issues of traditional deep learning methods.

Abstract: Graph Anomaly Detection (GAD) aims to identify nodes that deviate from the majority within a graph, playing a crucial role in applications such as social networks and e-commerce. Despite the current advancements in deep learning-based GAD, existing approaches often suffer from high deployment costs and poor scalability due to their complex and resource-intensive training processes. Surprisingly, our empirical findings suggest that the training phase of deep GAD methods, commonly perceived as crucial, may actually contribute less to anomaly detection performance than expected. Inspired by this, we propose FreeGAD, a novel training-free yet effective GAD method. Specifically, it leverages an affinity-gated residual encoder to generate anomaly-aware representations. Meanwhile, FreeGAD identifies anchor nodes as pseudo-normal and anomalous guides, followed by calculating anomaly scores through anchor-guided statistical deviations. Extensive experiments demonstrate that FreeGAD achieves superior anomaly detection performance, efficiency, and scalability on multiple benchmark datasets from diverse domains, without any training or iterative optimization.

[393] Comparative Analysis of Time Series Foundation Models for Demographic Forecasting: Enhancing Predictive Accuracy in US Population Dynamics

Aditya Akella, Jonathan Farah

Main category: cs.LG

TL;DR: Time series foundation models outperform traditional methods in demographic forecasting, achieving lowest MSE in 86.67% of cases, especially for minority populations with sparse data.

DetailsMotivation: Demographic shifts pose significant challenges for policymakers, and accurate forecasting is essential for informed decision-making in urban planning, healthcare, and economic policy.

Method: Evaluated Time Series Foundation Model (TimesFM) against traditional baselines (LSTM, ARIMA, Linear Regression) using U.S. Census Bureau and FRED datasets across six demographically diverse states.

Result: TimesFM achieved the lowest Mean Squared Error in 86.67% of test cases, with particularly strong performance on minority populations with sparse historical data.

Conclusion: Pre-trained foundation models have significant potential to enhance demographic analysis and inform proactive policy interventions without requiring extensive task-specific fine-tuning.

Abstract: Demographic shifts, influenced by globalization, economic conditions, geopolitical events, and environmental factors, pose significant challenges for policymakers and researchers. Accurate demographic forecasting is essential for informed decision-making in areas such as urban planning, healthcare, and economic policy. This study explores the application of time series foundation models to predict demographic changes in the United States using datasets from the U.S. Census Bureau and Federal Reserve Economic Data (FRED). We evaluate the performance of the Time Series Foundation Model (TimesFM) against traditional baselines including Long Short-Term Memory (LSTM) networks, Autoregressive Integrated Moving Average (ARIMA), and Linear Regression. Our experiments across six demographically diverse states demonstrate that TimesFM achieves the lowest Mean Squared Error (MSE) in 86.67% of test cases, with particularly strong performance on minority populations with sparse historical data. These findings highlight the potential of pre-trained foundation models to enhance demographic analysis and inform proactive policy interventions without requiring extensive task-specific fine-tuning.

[394] Robust Federated Learning under Adversarial Attacks via Loss-Based Client Clustering

Emmanouil Kritharakis, Dusan Jakovetic, Antonios Makris, Konstantinos Tserpes

Main category: cs.LG

TL;DR: A robust federated learning method that requires only one honest client and a trusted server with side data to defend against Byzantine attacks, outperforming existing baselines.

DetailsMotivation: Federated Learning is vulnerable to adversarial attacks from malicious clients, and existing robust aggregation methods often require knowing the number of attackers or have limited effectiveness.

Method: Proposes a Byzantine-robust FL approach that leverages a trusted server with side dataset and requires only one honest client, without needing prior knowledge of malicious client count.

Result: Theoretical analysis shows bounded optimality gaps under strong attacks. Experiments demonstrate superior performance over Mean, Trimmed Mean, Median, Krum, and Multi-Krum against various attack strategies on MNIST, FMNIST, and CIFAR-10 using Flower framework.

Conclusion: The method provides effective Byzantine robustness in FL with minimal trust assumptions (server + one client) and no need to know the number of malicious participants in advance.

Abstract: Federated Learning (FL) enables collaborative model training across multiple clients without sharing private data. We consider FL scenarios wherein FL clients are subject to adversarial (Byzantine) attacks, while the FL server is trusted (honest) and has a trustworthy side dataset. This may correspond to, e.g., cases where the server possesses trusted data prior to federation, or to the presence of a trusted client that temporarily assumes the server role. Our approach requires only two honest participants, i.e., the server and one client, to function effectively, without prior knowledge of the number of malicious clients. Theoretical analysis demonstrates bounded optimality gaps even under strong Byzantine attacks. Experimental results show that our algorithm significantly outperforms standard and robust FL baselines such as Mean, Trimmed Mean, Median, Krum, and Multi-Krum under various attack strategies including label flipping, sign flipping, and Gaussian noise addition across MNIST, FMNIST, and CIFAR-10 benchmarks using the Flower framework.

[395] Learning In-context n-grams with Transformers: Sub-n-grams Are Near-stationary Points

Aditya Varre, Gizem Yüce, Nicolas Flammarion

Main category: cs.LG

TL;DR: Transformer models exhibit stage-wise learning with sub-n-gram solutions as near-stationary points in the loss landscape, explaining plateau phenomena during training.

DetailsMotivation: Empirical observations of prolonged plateaus and stage-wise progression during transformer training motivated investigation of the loss landscape for in-context next-token prediction tasks.

Method: Analyzed learning in-context n-gram language models under cross-entropy loss, established sufficient conditions for stationary points, and constructed parameter configurations representing k-gram estimators for simplified transformer models.

Result: Gradient of population loss vanishes at sub-n-gram solutions in infinite sequence length limit, revealing sub-n-grams are near-stationary points that explain stage-wise learning dynamics and emergent phase transitions.

Conclusion: Theoretical analysis and numerical experiments demonstrate that sub-n-gram solutions form near-stationary regions in the loss landscape, providing insight into transformer training dynamics with discrete transitions between these solutions.

Abstract: Motivated by empirical observations of prolonged plateaus and stage-wise progression during training, we investigate the loss landscape of transformer models trained on in-context next-token prediction tasks. In particular, we focus on learning in-context $n$-gram language models under cross-entropy loss, and establish a sufficient condition for parameter configurations to be stationary points. We then construct a set of parameter configurations for a simplified transformer model that represent $k$-gram estimators (for $k \leq n$), and show that the gradient of the population loss at these solutions vanishes in the limit of infinite sequence length and parameter norm. This reveals a key property of the loss landscape: {sub-$n$-grams are near-stationary points of the population cross-entropy loss}, offering theoretical insight into widely observed phenomena such as stage-wise learning dynamics and emergent phase transitions. These insights are further supported by numerical experiments that illustrate the learning dynamics of $n$-grams, characterized by discrete transitions between near-stationary solutions.

[396] HRS: Hybrid Representation Framework with Scheduling Awareness for Time Series Forecasting in Crowdsourced Cloud-Edge Platforms

Tiancheng Zhang, Cheng Zhang, Shuren Liu, Xiaofei Wang, Shaoyuan Huang, Wenyu Wang

Main category: cs.LG

TL;DR: HRS framework combines numerical and image representations for better load forecasting in cloud-edge platforms, using scheduling-aware loss to reduce SLA violations by 63.1% and profit loss by 32.3%.

DetailsMotivation: Existing load forecasting methods either cause underprovisioning with SLA violations during peak traffic or conservative overprovisioning with high resource costs, creating a dilemma for streaming service QoS.

Method: Hybrid representation framework (HRS) integrating numerical and image-based representations to capture extreme load dynamics, plus Scheduling-Aware Loss (SAL) that accounts for asymmetric impact of prediction errors.

Result: HRS outperforms 10 baselines across 4 real-world datasets, reducing SLA violation rates by 63.1% and total profit loss by 32.3% compared to state-of-the-art methods.

Conclusion: The proposed HRS framework with scheduling awareness effectively addresses the load forecasting dilemma in CCPs, achieving superior performance in balancing QoS maintenance and resource efficiency.

Abstract: With the rapid proliferation of streaming services, network load exhibits highly time-varying and bursty behavior, posing serious challenges for maintaining Quality of Service (QoS) in Crowdsourced Cloud-Edge Platforms (CCPs). While CCPs leverage Predict-then-Schedule architecture to improve QoS and profitability, accurate load forecasting remains challenging under traffic surges. Existing methods either minimize mean absolute error, resulting in underprovisioning and potential Service Level Agreement (SLA) violations during peak periods, or adopt conservative overprovisioning strategies, which mitigate SLA risks at the expense of increased resource expenditure. To address this dilemma, we propose HRS, a hybrid representation framework with scheduling awareness that integrates numerical and image-based representations to better capture extreme load dynamics. We further introduce a Scheduling-Aware Loss (SAL) that captures the asymmetric impact of prediction errors, guiding predictions that better support scheduling decisions. Extensive experiments on four real-world datasets demonstrate that HRS consistently outperforms ten baselines and achieves state-of-the-art performance, reducing SLA violation rates by 63.1% and total profit loss by 32.3%.

[397] Training Machine Learning Models on Human Spatio-temporal Mobility Data: An Experimental Study [Experiment Paper]

Yueyang Liu, Lance Kennedy, Ruochen Kong, Joon-Seok Kim, Andreas Züfle

Main category: cs.LG

TL;DR: This paper explores best practices for training ML models to predict individuals’ complete trajectories over days/weeks, focusing on macro-level mobility patterns and life routines rather than just short-term movements.

DetailsMotivation: Existing research focuses mainly on microscopic aspects of human trajectories (short-term predictions), while macro-level mobility patterns and life routines remain underexplored. The paper aims to determine optimal training strategies for forecasting complete individual trajectories.

Method: Comprehensive experimental analysis using LSTM and Transformer architectures, incorporating semantic information (day-of-week, user-specific historical data), user semantic clustering with stratified sampling to address data skewness, and small-batch stochastic gradient optimization.

Result: Explicit inclusion of semantic information improves prediction accuracy by helping models understand individual life patterns. User sampling can exacerbate data skewness and reduce accuracy, but stratified sampling with semantic clustering preserves diversity. Small-batch optimization performs better with limited training data.

Conclusion: Incorporating semantic context and addressing data imbalance through careful sampling strategies significantly enhances human mobility prediction performance, with small-batch optimization being particularly effective for limited data scenarios.

Abstract: Individual-level human mobility prediction has emerged as a significant topic of research with applications in infectious disease monitoring, child, and elderly care. Existing studies predominantly focus on the microscopic aspects of human trajectories: such as predicting short-term trajectories or the next location visited, while offering limited attention to macro-level mobility patterns and the corresponding life routines. In this paper, we focus on an underexplored problem in human mobility prediction: determining the best practices to train a machine learning model using historical data to forecast an individuals complete trajectory over the next days and weeks. In this experiment paper, we undertake a comprehensive experimental analysis of diverse models, parameter configurations, and training strategies, accompanied by an in-depth examination of the statistical distribution inherent in human mobility patterns. Our empirical evaluations encompass both Long Short-Term Memory and Transformer-based architectures, and further investigate how incorporating individual life patterns can enhance the effectiveness of the prediction. We show that explicitly including semantic information such as day-of-the-week and user-specific historical information can help the model better understand individual patterns of life and improve predictions. Moreover, since the absence of explicit user information is often missing due to user privacy, we show that the sampling of users may exacerbate data skewness and result in a substantial loss in predictive accuracy. To mitigate data imbalance and preserve diversity, we apply user semantic clustering with stratified sampling to ensure that the sampled dataset remains representative. Our results further show that small-batch stochastic gradient optimization improves model performance, especially when human mobility training data is limited.

cs.MA

[398] Goal-Directedness is in the Eye of the Beholder

Nina Rajcic, Anders Søgaard

Main category: cs.MA

TL;DR: Goal-directedness cannot be objectively measured in complex agent systems, challenging both behavioral and mechanistic approaches to goal attribution.

DetailsMotivation: To address the fundamental problems in formalizing and measuring goal-directed behavior in complex agents, as current approaches (behavioral observation and mechanistic probing of internal states) face significant technical and conceptual challenges.

Method: The paper analyzes the assumptions behind both behavioral and mechanistic approaches to goal attribution, identifying their technical and conceptual limitations when formalizing goals in agent systems.

Result: The analysis concludes that goal-directedness cannot be measured objectively through either behavioral observation or mechanistic probing of internal model states.

Conclusion: The paper proposes new directions for modeling goal-directedness as an emergent property of dynamic, multi-agent systems, suggesting a shift from objective measurement to emergent property modeling.

Abstract: Our ability to predict the behavior of complex agents turns on the attribution of goals. Probing for goal-directed behavior comes in two flavors: Behavioral and mechanistic. The former proposes that goal-directedness can be estimated through behavioral observation, whereas the latter attempts to probe for goals in internal model states. We work through the assumptions behind both approaches, identifying technical and conceptual problems that arise from formalizing goals in agent systems. We arrive at the perhaps surprising position that goal-directedness cannot be measured objectively. We outline new directions for modeling goal-directedness as an emergent property of dynamic, multi-agent systems.

[399] Self-Organizing Agent Network for LLM-based Workflow Automation

Yiming Xiong, Jian Wang, Bing Li, Yuhan Zhu, Yuqi Zhao

Main category: cs.MA

TL;DR: SOAN is a structure-driven orchestration framework that builds formalized agent networks to handle complex enterprise workflows with multi-layer nesting, outperforming state-of-the-art methods.

DetailsMotivation: Real-world enterprise workflows involve complex, deeply nested execution paths that challenge LLM-driven orchestration due to extended reasoning chains and state-space explosions.

Method: SOAN incrementally builds a formalized agent network by identifying and encapsulating structural units as independent agents, enhancing modularity and clarity.

Result: Extensive evaluations show SOAN significantly outperforms state-of-the-art methods in adaptability, fault tolerance, and execution efficiency.

Conclusion: SOAN provides an effective solution for handling complex multi-layer nested workflows through structure-driven orchestration with formalized agent networks.

Abstract: Recent multi-agent frameworks built upon large language models (LLMs) have demonstrated remarkable capabilities in complex task planning. However, in real-world enterprise environments, business workflows are typically composed through modularization and reuse of numerous subprocesses, resulting in intricate workflows characterized by lengthy and deeply nested execution paths. Such complexity poses significant challenges for LLM-driven orchestration, as extended reasoning chains and state-space explosions severely impact planning effectiveness and the proper sequencing of tool invocations. Therefore, developing an orchestration method with controllable structures capable of handling multi-layer nesting becomes a critical issue. To address this, we propose a novel structure-driven orchestration framework Self-Organizing Agent Network (SOAN). SOAN incrementally builds a formalized agent network by identifying and encapsulating structural units as independent agents, enhancing modularity and clarity in orchestration. Extensive evaluations were performed using multiple benchmarks as well as a real-world enterprise workflow dataset. Experimental results demonstrate that SOAN significantly outperforms state-of-the-art methods in terms of adaptability, fault tolerance, and execution efficiency.

[400] BetaWeb: Towards a Blockchain-enabled Trustworthy Agentic Web

Zihan Guo, Yuanjian Zhou, Chenyi Wang, Linlin You, Minjie Bian, Weinan Zhang

Main category: cs.MA

TL;DR: BetaWeb introduces a blockchain-based framework to address privacy, data management, and value measurement challenges in LLM-based multi-agent systems, enabling trustworthy and scalable agent interactions.

DetailsMotivation: Current agentic ecosystems are fragmented and closed, with existing centralized/semi-centralized paradigms unable to support large-scale, heterogeneous, cross-domain autonomous interactions due to privacy, data management, and value measurement challenges.

Method: Proposes BetaWeb - a blockchain-enabled trustworthy Agentic Web framework that leverages blockchain’s inherent strengths to provide scalable infrastructure for LLM-based multi-agent systems, with a five-stage evolutionary roadmap from passive execution to autonomous governance.

Result: BetaWeb offers a foundation for resilient, trustworthy, and sustainably incentivized digital ecosystem, potentially advancing Web paradigm from Web3 (data ownership) to Web3.5 (agent capability ownership and intelligence monetization).

Conclusion: Deep integration between blockchain and LLM-based multi-agent systems can establish interconnected and scalable paradigm for Agentic AI, addressing core ecosystem barriers through decentralized, trustworthy infrastructure.

Abstract: The rapid development of large language models (LLMs) has significantly propelled the development of artificial intelligence (AI) agents, which are increasingly evolving into diverse autonomous entities, advancing the LLM-based multi-agent systems (LaMAS). However, current agentic ecosystems remain fragmented and closed. Establishing an interconnected and scalable paradigm for Agentic AI has become a critical prerequisite. Although Agentic Web proposes an open architecture to break the ecosystem barriers, its implementation still faces core challenges such as privacy protection, data management, and value measurement. Existing centralized or semi-centralized paradigms suffer from inherent limitations, making them inadequate for supporting large-scale, heterogeneous, and cross-domain autonomous interactions. To address these challenges, this paper introduces the blockchain-enabled trustworthy Agentic Web (BetaWeb). By leveraging the inherent strengths of blockchain, BetaWeb not only offers a trustworthy and scalable infrastructure for LaMAS but also has the potential to advance the Web paradigm from Web3 (centered on data ownership) towards Web3.5, which emphasizes ownership of agent capabilities and the monetization of intelligence. Beyond a systematic examination of the BetaWeb framework, this paper presents a five-stage evolutionary roadmap, outlining the path of LaMAS from passive execution to advanced collaboration and autonomous governance. We also conduct a comparative analysis of existing products and discuss key challenges of BetaWeb from multiple perspectives. Ultimately, we argue that deep integration between blockchain and LaMAS can lay the foundation for a resilient, trustworthy, and sustainably incentivized digital ecosystem. A summary of the enabling technologies for each stage is available at https://github.com/MatZaharia/BetaWeb.

[401] COCO: Cognitive Operating System with Continuous Oversight for Multi-Agent Workflow Reliability

Churong Liang, Jinling Gan, Kairan Hong, Qiushi Tian, Zongze Wu, Runnan Li

Main category: cs.MA

TL;DR: COCO framework introduces continuous oversight for multi-agent workflows with O(1) monitoring overhead, achieving 6.5% performance improvement through contextual rollback, bidirectional reflection, and heterogeneous cross-validation.

DetailsMotivation: Large-scale multi-agent workflows are vulnerable to error propagation and quality degradation without corrective mechanisms for upstream failures.

Method: Decoupled architecture separating error detection from execution path, with three innovations: contextual rollback mechanism, bidirectional reflection protocol, and heterogeneous cross-validation using ensemble disagreement metrics.

Result: 6.5% average performance improvement on benchmark multi-agent tasks, establishing new state-of-the-art for autonomous workflow reliability.

Conclusion: COCO provides a theoretically-grounded framework for continuous oversight in multi-agent systems, effectively addressing error propagation while maintaining computational efficiency.

Abstract: Large-scale multi-agent workflows exhibit inherent vulnerability to error propagation and quality degradation, where downstream agents compound upstream failures without corrective mechanisms. We introduce COCO (Cognitive Operating System with Continuous Oversight), a theoretically-grounded framework that implements asynchronous self-monitoring and adaptive error correction in multi-agent driven systems. COCO addresses the fundamental trade-off between quality assurance and computational efficiency through a novel decoupled architecture that separates error detection from the critical execution path, achieving $O(1)$ monitoring overhead relative to workflow complexity. COCO employs three key algorithmic innovations to address systematic and stochastic errors: (1) Contextual Rollback Mechanism - a stateful restart protocol that preserves execution history and error diagnostics, enabling informed re-computation rather than naive retry; (2) Bidirectional Reflection Protocol - a mutual validation system between monitoring and execution modules that prevents oscillatory behavior and ensures convergence; (3) Heterogeneous Cross-Validation - leveraging model diversity to detect systematic biases and hallucinations through ensemble disagreement metrics. Extensive experiments on benchmark multi-agent tasks demonstrate 6.5% average performance improvement, establishing new state-of-the-art for autonomous workflow reliability.

[402] The Multi-Stage Assignment Problem: A Fairness Perspective

Vibulan J, Swapnil Dhamal, Shweta Jain

Main category: cs.MA

TL;DR: This paper addresses fair path assignment in multi-stage graphs, showing that cost-minimizing assignments can be highly unfair. It proves NP-hardness of envy minimization, proposes C-Balance and DC-Balance algorithms with bounded envy guarantees, and demonstrates significant speed advantages over ILP approaches.

DetailsMotivation: Traditional efficient assignments that minimize overall costs in multi-stage graphs can create significant cost disparities (envy) among agents, leading to unfair outcomes that need to be addressed.

Method: The authors propose C-Balance algorithm for two agents with bounded envy guarantee, then extend to n agents with DC-Balance that makes iterative calls to C-Balance. Both algorithms provide theoretical bounds on envy and cost of fairness.

Result: C-Balance guarantees envy bounded by 2M (maximum edge weight) for two agents, with tightness demonstrated. DC-Balance achieves envy arbitrarily close to 2M for n agents. Both algorithms show bounded cost of fairness ratios, with experimental results showing orders of magnitude faster performance than ILP.

Conclusion: The proposed algorithms provide effective solutions for fair path assignment in multi-stage graphs with proven envy bounds and reasonable cost of fairness, offering practical advantages over traditional optimization approaches.

Abstract: This paper explores the problem of fair assignment on Multi-Stage graphs. A multi-stage graph consists of nodes partitioned into $K$ disjoint sets (stages) structured as a sequence of weighted bipartite graphs formed across adjacent stages. The goal is to assign node-disjoint paths to $n$ agents starting from the first stage and ending in the last stage. We show that an efficient assignment that minimizes the overall sum of costs of all the agents’ paths may be highly unfair and lead to significant cost disparities (envy) among the agents. We further show that finding an envy-minimizing assignment on a multi-stage graph is NP-hard. We propose the C-Balance algorithm, which guarantees envy that is bounded by $2M$ in the case of two agents, where $M$ is the maximum edge weight. We demonstrate the algorithm’s tightness by presenting an instance where the envy is $2M$. We further show that the cost of fairness ($CoF$), defined as the ratio of the cost of the assignment given by the fair algorithm to that of the minimum cost assignment, is bounded by $2$ for C-Balance. We then extend this approach to $n$ agents by proposing the DC-Balance algorithm that makes iterative calls to C-Balance. We show the convergence of DC-Balance, resulting in envy that is arbitrarily close to $2M$. We derive $CoF$ bounds for DC-Balance and provide insights about its dependency on the instance-specific parameters and the desired degree of envy. We experimentally show that our algorithm runs several orders of magnitude faster than a suitably formulated ILP.

cs.MM

[403] Robust Live Streaming over LEO Satellite Constellations: Measurement, Analysis, and Handover-Aware Adaptation

Hao Fang, Haoyuan Zhao, Jianxin Shi, Miao Zhang, Guanzhen Wu, Yi Ching Chou, Feng Wang, Jiangchuan Liu

Main category: cs.MM

TL;DR: SARA is a middleware that improves live streaming over satellite networks by reducing rebuffering during handovers through intelligent playback speed modulation and network insights.

DetailsMotivation: Existing live streaming platforms perform poorly on Low Earth Orbit Satellite Networks due to frequent satellite handovers causing video rebuffering, and current ABR algorithms fail to handle these abrupt network variations effectively.

Method: SARA is a lightweight middleware that integrates with various ABR algorithms, intelligently modulates video playback speed, and provides ABR algorithms with insights about LSNs’ unique network characteristics to make better bitrate selections.

Result: SARA reduces rebuffering time by an average of 39.41%, slightly improves latency by 0.65%, with only a 0.13% overall loss in bitrate.

Conclusion: SARA effectively addresses the challenges of live streaming over satellite networks by minimizing rebuffering events during satellite handovers while maintaining good streaming quality.

Abstract: Live streaming has experienced significant growth recently. Yet this rise in popularity contrasts with the reality that a substantial segment of the global population still lacks Internet access. The emergence of Low Earth orbit Satellite Networks (LSNs), such as SpaceX’s Starlink and Amazon’s Project Kuiper, presents a promising solution to fill this gap. Nevertheless, our measurement study reveals that existing live streaming platforms may not be able to deliver a smooth viewing experience on LSNs due to frequent satellite handovers, which lead to frequent video rebuffering events. Current state-of-the-art learning-based Adaptive Bitrate (ABR) algorithms, even when trained on LSNs’ network traces, fail to manage the abrupt network variations associated with satellite handovers effectively. To address these challenges, for the first time, we introduce Satellite-Aware Rate Adaptation (SARA), a versatile and lightweight middleware that can seamlessly integrate with various ABR algorithms to enhance the performance of live streaming over LSNs. SARA intelligently modulates video playback speed and furnishes ABR algorithms with insights derived from the distinctive network characteristics of LSNs, thereby aiding ABR algorithms in making informed bitrate selections and effectively minimizing rebuffering events that occur during satellite handovers. Our extensive evaluation shows that SARA can effectively reduce the rebuffering time by an average of $39.41%$ and slightly improve latency by $0.65%$ while only introducing an overall loss in bitrate by $0.13%$.

[404] INDS: Incremental Named Data Streaming for Real-Time Point Cloud Video

Ruonan Chai, Yixiang Zhu, Xinjiao Li, Jiawei Li, Zili Meng, Dirk Kutscher

Main category: cs.MM

TL;DR: INDS is an adaptive streaming framework for point cloud video that uses Information-Centric Networking and Octree structure to enable progressive retrieval, fine-grained caching, and improved performance over traditional systems.

DetailsMotivation: Real-time point cloud streaming faces challenges with massive data volumes, packet loss sensitivity, and limitations of current transport protocols (TCP/QUIC) which have coarse-grained delivery models and centralized control that restrict fine-grained adaptation and effective caching.

Method: INDS leverages Information-Centric Networking (ICN) and the Octree structure of point cloud video with expressive content naming to support progressive, partial retrieval of enhancement layers based on bandwidth and decoding capability. It combines time-windows with Group-of-Frames for fine-grained caching and multi-user data reuse, deployable as an overlay compatible with QUIC and Media-over-QUIC architectures.

Result: Prototype implementation shows 80% lower delay, 15-50% higher throughput, and 20-30% increased cache hit rates compared to state-of-the-art DASH-style systems.

Conclusion: INDS establishes itself as a scalable, cache-friendly solution for real-time point cloud streaming under variable conditions, with practical forward-compatibility for emerging immersive media systems through MoQ overlay compatibility.

Abstract: Real-time streaming of point cloud video, characterized by massive data volumes and high sensitivity to packet loss, remains a key challenge for immersive applications under dynamic network conditions. While connection-oriented protocols such as TCP and more modern alternatives like QUIC alleviate some transport-layer inefficiencies, including head-of-line blocking, they still retain a coarse-grained, segment-based delivery model and a centralized control loop that limit fine-grained adaptation and effective caching. We introduce INDS (Incremental Named Data Streaming), an adaptive streaming framework based on Information-Centric Networking (ICN) that rethinks delivery for hierarchical, layered media. INDS leverages the Octree structure of point cloud video and expressive content naming to support progressive, partial retrieval of enhancement layers based on consumer bandwidth and decoding capability. By combining time-windows with Group-of-Frames (GoF), INDS’s naming scheme supports fine-grained in-network caching and facilitates efficient multi-user data reuse. INDS can be deployed as an overlay, remaining compatible with QUIC-based transport infrastructure as well as future Media-over-QUIC (MoQ) architectures, without requiring changes to underlying IP networks. Our prototype implementation shows up to 80% lower delay, 15-50% higher throughput, and 20-30% increased cache hit rates compared to state-of-the-art DASH-style systems. Together, these results establish INDS as a scalable, cache-friendly solution for real-time point cloud streaming under variable and lossy conditions, while its compatibility with MoQ overlays further positions it as a practical, forward-compatible architecture for emerging immersive media systems.

[405] MAGNeT: Multimodal Adaptive Gaussian Networks for Intent Inference in Moving Target Selection across Complex Scenarios

Xiangxian Li, Yawen Zheng, Baiqiao Zhang, Yijia Ma, Xianhui Cao, Juan Liu, Yulong Bian, Jin Huang, Chenglei Yang

Main category: cs.MM

TL;DR: MAGNeT: A multimodal adaptive Gaussian network that combines classical statistical modeling with context-aware methods to improve moving target selection in diverse multimedia environments with minimal training data.

DetailsMotivation: Existing probabilistic models for moving target selection require substantial training data for each new context and lack transferability across scenarios, limiting practical deployment in diverse multimedia environments with rich multimodal contextual information.

Method: MAGNeT dynamically fuses pre-fitted Ternary-Gaussian models from various scenarios based on real-time contextual cues, combining classical statistical modeling with a context-aware multimodal approach.

Result: Extensive experiments on 2D and 3D moving target selection datasets under in-vehicle vibration conditions demonstrate that MAGNeT achieves lower error rates with few-shot samples through context-aware fusion of Gaussian experts from multi-factor conditions.

Conclusion: MAGNeT effectively addresses the limitations of existing approaches by enabling adaptation with minimal training data while preserving model interpretability, making it suitable for diverse multimedia interactive systems.

Abstract: Moving target selection in multimedia interactive systems faces unprecedented challenges as users increasingly interact across diverse and dynamic contexts-from live streaming in moving vehicles to VR gaming in varying environments. Existing approaches rely on probabilistic models that relate endpoint distribution to target properties such as size and speed. However, these methods require substantial training data for each new context and lack transferability across scenarios, limiting their practical deployment in diverse multimedia environments where rich multimodal contextual information is readily available. This paper introduces MAGNeT (Multimodal Adaptive Gaussian Networks), which addresses these problems by combining classical statistical modeling with a context-aware multimodal method. MAGNeT dynamically fuses pre-fitted Ternary-Gaussian models from various scenarios based on real-time contextual cues, enabling effective adaptation with minimal training data while preserving model interpretability. We conduct experiments on self-constructed 2D and 3D moving target selection datasets under in-vehicle vibration conditions. Extensive experiments demonstrate that MAGNeT achieves lower error rates with few-shot samples by applying context-aware fusion of Gaussian experts from multi-factor conditions.

eess.AS

[406] Rapidly Adapting to New Voice Spoofing: Few-Shot Detection of Synthesized Speech Under Distribution Shifts

Ashi Garg, Zexin Cai, Henry Li Xinyuan, Leibny Paola García-Perera, Kevin Duh, Sanjeev Khudanpur, Matthew Wiesner, Nicholas Andrews

Main category: eess.AS

TL;DR: Proposes self-attentive prototypical network for few-shot adaptation to detect synthesized speech under distribution shifts from unseen synthesis methods, speakers, languages, or audio conditions.

DetailsMotivation: Address the challenge of detecting synthesized speech under distribution shifts relative to training data, where traditional zero-shot detectors struggle with unseen synthesis methods, speakers, languages, or audio conditions.

Method: Self-attentive prototypical network for few-shot learning that rapidly adapts using a few in-distribution samples to handle distribution shifts.

Result: Achieves up to 32% relative EER reduction on Japanese deepfakes and 20% relative reduction on ASVspoof 2021 Deepfake dataset using as few as 10 in-distribution samples.

Conclusion: Few-shot adaptation with self-attentive prototypical networks effectively addresses distribution shifts in synthesized speech detection, significantly outperforming traditional zero-shot approaches.

Abstract: We address the challenge of detecting synthesized speech under distribution shifts – arising from unseen synthesis methods, speakers, languages, or audio conditions – relative to the training data. Few-shot learning methods are a promising way to tackle distribution shifts by rapidly adapting on the basis of a few in-distribution samples. We propose a self-attentive prototypical network to enable more robust few-shot adaptation. To evaluate our approach, we systematically compare the performance of traditional zero-shot detectors and the proposed few-shot detectors, carefully controlling training conditions to introduce distribution shifts at evaluation time. In conditions where distribution shifts hamper the zero-shot performance, our proposed few-shot adaptation technique can quickly adapt using as few as 10 in-distribution samples – achieving upto 32% relative EER reduction on deepfakes in Japanese language and 20% relative reduction on ASVspoof 2021 Deepfake dataset.

[407] End-to-End Audio-Visual Learning for Cochlear Implant Sound Coding in Noisy Environments

Meng-Ping Lin, Enoch Hsin-Ho Huang, Shao-Yi Chien, Yu Tsao

Main category: eess.AS

TL;DR: A novel noise-suppressing cochlear implant system called AVSE-ECS that combines audio-visual speech enhancement with deep learning sound coding to improve speech comprehension in noisy environments.

DetailsMotivation: Current cochlear implants struggle with speech comprehension in noisy or reverberant conditions, despite recent advancements. Deep learning and multimodal approaches using visual cues offer promising opportunities to enhance CI performance.

Method: Developed AVSE-ECS system that uses an audio-visual speech enhancement model as pre-processing for the ElectrodeNet-CS sound coding strategy. Applied joint training approach to create an end-to-end CI system.

Result: The proposed method outperforms previous ECS strategy in noisy conditions, showing improved objective speech intelligibility scores.

Conclusion: The study demonstrates the feasibility and potential of integrating audio-visual speech enhancement modules into end-to-end cochlear implant systems using deep learning approaches.

Abstract: The cochlear implant (CI) is a remarkable biomedical device that successfully enables individuals with severe-to-profound hearing loss to perceive sound by converting speech into electrical stimulation signals. Despite advancements in the performance of recent CI systems, speech comprehension in noisy or reverberant conditions remains a challenge. Recent and ongoing developments in deep learning reveal promising opportunities for enhancing CI sound coding capabilities, not only through replicating traditional signal processing methods with neural networks, but also through integrating visual cues as auxiliary data for multimodal speech processing. Therefore, this paper introduces a novel noise-suppressing CI system, AVSE-ECS, which utilizes an audio-visual speech enhancement (AVSE) model as a pre-processing module for the deep-learning-based ElectrodeNet-CS (ECS) sound coding strategy. Specifically, a joint training approach is applied to model AVSE-ECS, an end-to-end CI system. Experimental results indicate that the proposed method outperforms the previous ECS strategy in noisy conditions, with improved objective speech intelligibility scores. The methods and findings in this study demonstrate the feasibility and potential of using deep learning to integrate the AVSE module into an end-to-end CI system

[408] MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence

Sonal Kumar, Šimon Sedláček, Vaibhavi Lokegaonkar, Fernando López, Wenyi Yu, Nishit Anand, Hyeonggon Ryu, Lichang Chen, Maxim Plička, Miroslav Hlaváček, William Fineas Ellingwood, Sathvik Udupa, Siyuan Hou, Allison Ferner, Sara Barahona, Cecilia Bolaños, Satish Rahi, Laura Herrera-Alarcón, Satvik Dixit, Siddhi Patil, Soham Deshmukh, Lasha Koroshinadze, Yao Liu, Leibny Paola Garcia Perera, Eleni Zanou, Themos Stafylakis, Joon Son Chung, David Harwath, Chao Zhang, Dinesh Manocha, Alicia Lozano-Diez, Santosh Kesiraju, Sreyan Ghosh, Ramani Duraiswami

Main category: eess.AS

TL;DR: MMAU-Pro is a comprehensive benchmark for evaluating audio intelligence in AI systems, covering speech, sound, music, and their combinations across 49 skills and complex dimensions, revealing significant limitations in current state-of-the-art models.

DetailsMotivation: Audio comprehension is essential for human-level intelligence, but existing benchmarks fail to comprehensively evaluate auditory intelligence across diverse audio types and complex reasoning tasks.

Method: Created a benchmark with 5,305 instances featuring audio-question-answer pairs sourced from the wild, requiring multi-hop reasoning across 49 skills including long-form comprehension, spatial reasoning, and multi-audio understanding.

Result: Evaluation of 22 leading AI models showed poor performance - Gemini 2.5 Flash (59.2%) and Audio Flamingo 3 (51.7%) accuracy, approaching random performance in multiple categories.

Conclusion: Current AI models have significant limitations in audio intelligence, and MMAU-Pro provides actionable insights for improving future systems toward audio general intelligence.

Abstract: Audio comprehension-including speech, non-speech sounds, and music-is essential for achieving human-level intelligence. Consequently, AI agents must demonstrate holistic audio understanding to qualify as generally intelligent. However, evaluating auditory intelligence comprehensively remains challenging. To address this gap, we introduce MMAU-Pro, the most comprehensive and rigorously curated benchmark for assessing audio intelligence in AI systems. MMAU-Pro contains 5,305 instances, where each instance has one or more audios paired with human expert-generated question-answer pairs, spanning speech, sound, music, and their combinations. Unlike existing benchmarks, MMAU-Pro evaluates auditory intelligence across 49 unique skills and multiple complex dimensions, including long-form audio comprehension, spatial audio reasoning, multi-audio understanding, among others. All questions are meticulously designed to require deliberate multi-hop reasoning, including both multiple-choice and open-ended response formats. Importantly, audio data is sourced directly ``from the wild" rather than from existing datasets with known distributions. We evaluate 22 leading open-source and proprietary multimodal AI models, revealing significant limitations: even state-of-the-art models such as Gemini 2.5 Flash and Audio Flamingo 3 achieve only 59.2% and 51.7% accuracy, respectively, approaching random performance in multiple categories. Our extensive analysis highlights specific shortcomings and provides novel insights, offering actionable perspectives for the community to enhance future AI systems’ progression toward audio general intelligence. The benchmark and code is available at https://sonalkum.github.io/mmau-pro.

[409] Less is More: Data Curation Matters in Scaling Speech Enhancement

Chenda Li, Wangyou Zhang, Wei Wang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Yihui Fu, Marvin Sach, Zhaoheng Ni, Anurag Kumar, Tim Fingscheidt, Shinji Watanabe, Yanmin Qian

Main category: eess.AS

TL;DR: Quality over quantity: Curating high-quality training data (700 hours) outperforms larger datasets (2500 hours) with prevalent quality issues in speech enhancement systems.

DetailsMotivation: To address diminishing returns in scaling speech enhancement data by focusing on quality issues in "clean" training labels within large-scale datasets.

Method: Carefully curated subset selection from large-scale datasets, comparing models trained on 700-hour high-quality subset vs 2500-hour full dataset.

Result: Models trained on curated 700-hour subset outperform models trained on full 2500-hour dataset, demonstrating superior performance with less but higher quality data.

Conclusion: Data curation is more crucial than data volume scaling for effective speech enhancement systems, highlighting the importance of prioritizing high-quality training data over mere dataset expansion.

Abstract: The vast majority of modern speech enhancement systems rely on data-driven neural network models. Conventionally, larger datasets are presumed to yield superior model performance, an observation empirically validated across numerous tasks in other domains. However, recent studies reveal diminishing returns when scaling speech enhancement data. We focus on a critical factor: prevalent quality issues in ``clean’’ training labels within large-scale datasets. This work re-examines this phenomenon and demonstrates that, within large-scale training sets, prioritizing high-quality training data is more important than merely expanding the data volume. Experimental findings suggest that models trained on a carefully curated subset of 700 hours can outperform models trained on the 2,500-hour full dataset. This outcome highlights the crucial role of data curation in scaling speech enhancement systems effectively.

[410] Speech Enhancement based on cascaded two flows

Seonggyu Lee, Sein Cheong, Sangwook Han, Kihyuk Kim, Jong Won Shin

Main category: eess.AS

TL;DR: Proposes a flow matching-based speech enhancement method that uses the same model for both enhancement and generating initial conditions, achieving better performance with fewer function evaluations compared to previous approaches.

DetailsMotivation: Existing diffusion-based speech enhancement requires high computational cost (many function evaluations), while flow matching approaches need separate predictive models for conditioning and initial sampling, adding complexity.

Method: Uses identical flow matching model for both speech enhancement and generating enhanced speech used as initial starting point and conditioning variable, eliminating need for separate predictive model.

Result: Experimental results show equivalent or better performance than baselines with same or fewer function evaluations, despite using two cascaded generative methods.

Conclusion: The proposed approach demonstrates efficient speech enhancement using a unified flow matching model that reduces computational requirements while maintaining or improving performance.

Abstract: Speech enhancement (SE) based on diffusion probabilistic models has exhibited impressive performance, while requiring a relatively high number of function evaluations (NFE). Recently, SE based on flow matching has been proposed, which showed competitive performance with a small NFE. Early approaches adopted the noisy speech as the only conditioning variable. There have been other approaches which utilize speech enhanced with a predictive model as another conditioning variable and to sample an initial value, but they require a separate predictive model on top of the generative SE model. In this work, we propose to employ an identical model based on flow matching for both SE and generating enhanced speech used as an initial starting point and a conditioning variable. Experimental results showed that the proposed method required the same or fewer NFEs even with two cascaded generative methods while achieving equivalent or better performances to the previous baselines.

eess.IV

[411] Sex-Specific Vascular Score: A Novel Perfusion Biomarker from Supervoxel Analysis of 3D pCASL MRI

Sneha Noble, Neelam Sinha, Vaanathi Sundareshan, Thomas Gregor Issac

Main category: eess.IV

TL;DR: A novel framework using 3D pCASL MRI to compute sex-specific vascular scores for cerebrovascular health assessment through supervoxel clustering and CNN-based sex classification with 95% accuracy.

DetailsMotivation: To develop a quantitative method for assessing cerebrovascular health and disease susceptibility that accounts for sex-specific differences in brain perfusion patterns, which could enable early detection of neurodegenerative diseases like Alzheimer's.

Method: Leverages 3D pseudo-continuous arterial spin labeling MRI, supervoxel clustering for brain parcellation into homogeneous perfusion regions, extracts mean cerebral blood flow values from 186 healthy participants, and trains a custom convolutional neural network for sex classification.

Result: Achieved 95% accuracy in sex classification, highlighting robust sex-specific perfusion patterns across the brain. Also systematically evaluated regional CBF variations and age-related effects within male and female cohorts.

Conclusion: The proposed vascular risk-scoring framework enhances understanding of normative brain perfusion and aging, and may facilitate early detection and personalized interventions for neurodegenerative diseases.

Abstract: We propose a novel framework that leverages 3D pseudo-continuous arterial spin labeling (3D pCASL) MRI to compute sex-specific vascular scores that quantify cerebrovascular health and potential disease susceptibility. The brain is parcellated into spatially contiguous regions of homogeneous perfusion using supervoxel clustering, capturing both microvascular and macrovascular contributions. Mean cerebral blood flow (CBF) values are extracted from 186 cognitively healthy participants and used to train a custom convolutional neural network, achieving 95 percent accuracy in sex classification. This highlights robust, sex-specific perfusion patterns across the brain. Additionally, regional CBF variations and age-related effects are systematically evaluated within male and female cohorts. The proposed vascular risk-scoring framework enhances understanding of normative brain perfusion and aging, and may facilitate early detection and personalized interventions for neurodegenerative diseases such as Alzheimer’s.

[412] Colon Polyps Detection from Colonoscopy Images Using Deep Learning

Md Al Amin, Bikash Kumar Paul

Main category: eess.IV

TL;DR: YOLOv5l achieves 85.1% mAP for colon polyp detection using deep learning on colonoscopy images, outperforming other YOLOv5 variants.

DetailsMotivation: Colorectal cancer is a leading cause of cancer mortality worldwide, and early detection of colon polyps (precursors to cancer) through colonoscopy is critical for improving patient outcomes.

Method: Used Kvasir-SEG dataset with extensive data augmentation, split into training (80%), validation (20% of training), and testing (20%) sets. Evaluated three YOLOv5 architecture variants (YOLOv5s, YOLOv5m, YOLOv5l) for polyp detection.

Result: YOLOv5l achieved the best performance with 85.1% mean average precision (mAP) and highest average Intersection over Union (IoU) of 0.86, outperforming the other YOLOv5 variants.

Conclusion: YOLOv5l provides superior detection performance for colon polyp localization, offering a promising tool for enhancing colorectal cancer screening accuracy through automated deep learning-based object detection.

Abstract: Colon polyps are precursors to colorectal cancer, a leading cause of cancer-related mortality worldwide. Early detection is critical for improving patient outcomes. This study investigates the application of deep learning-based object detection for early polyp identification using colonoscopy images. We utilize the Kvasir-SEG dataset, applying extensive data augmentation and splitting the data into training (80%), validation (20% of training), and testing (20%) sets. Three variants of the YOLOv5 architecture (YOLOv5s, YOLOv5m, YOLOv5l) are evaluated. Experimental results show that YOLOv5l outperforms the other variants, achieving a mean average precision (mAP) of 85.1%, with the highest average Intersection over Union (IoU) of 0.86. These findings demonstrate that YOLOv5l provides superior detection performance for colon polyp localization, offering a promising tool for enhancing colorectal cancer screening accuracy.

[413] Benchmarking GPT-5 for Zero-Shot Multimodal Medical Reasoning in Radiology and Radiation Oncology

Mingzhe Hu, Zach Eidex, Shansong Wang, Mojtaba Safari, Qiang Li, Xiaofeng Yang

Main category: eess.IV

TL;DR: GPT-5 shows significant performance improvements over GPT-4o in medical imaging and physics tasks, achieving up to +20% gains in challenging anatomical regions and exceeding human passing thresholds on board exam questions.

DetailsMotivation: To evaluate whether recent advances in large multimodal models like GPT-5 translate into measurable performance gains in safety-critical medical domains such as radiology, radiation oncology, and medical physics where decision-making integrates medical images, textual reports, and quantitative data.

Method: Conducted a targeted zero-shot evaluation comparing GPT-5 and its variants (GPT-5-mini, GPT-5-nano) against GPT-4o across three representative tasks: VQA-RAD (radiology visual question answering), SLAKE (multilingual cross-modal grounding), and a curated Medical Physics Board Examination-style dataset with 150 multiple-choice questions.

Result: GPT-5 achieved the highest accuracy across all datasets with substantial gains: +20.00% in chest-mediastinal regions, +13.60% in lung-focused questions, +11.44% in brain-tissue interpretation. On physics board questions, GPT-5 attained 90.7% accuracy (136/150) exceeding human passing threshold, while GPT-4o scored 78.0%.

Conclusion: GPT-5 delivers consistent and often pronounced performance improvements over GPT-4o in both image-grounded reasoning and domain-specific numerical problem-solving, highlighting its potential to augment expert workflows in medical imaging and therapeutic physics.

Abstract: Radiology, radiation oncology, and medical physics require decision-making that integrates medical images, textual reports, and quantitative data under high-stakes conditions. With the introduction of GPT-5, it is critical to assess whether recent advances in large multimodal models translate into measurable gains in these safety-critical domains. We present a targeted zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano) against GPT-4o across three representative tasks. We present a targeted zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano) against GPT-4o across three representative tasks: (1) VQA-RAD, a benchmark for visual question answering in radiology; (2) SLAKE, a semantically annotated, multilingual VQA dataset testing cross-modal grounding; and (3) a curated Medical Physics Board Examination-style dataset of 150 multiple-choice questions spanning treatment planning, dosimetry, imaging, and quality assurance. Across all datasets, GPT-5 achieved the highest accuracy, with substantial gains over GPT-4o up to +20.00% in challenging anatomical regions such as the chest-mediastinal, +13.60% in lung-focused questions, and +11.44% in brain-tissue interpretation. On the board-style physics questions, GPT-5 attained 90.7% accuracy (136/150), exceeding the estimated human passing threshold, while GPT-4o trailed at 78.0%. These results demonstrate that GPT-5 delivers consistent and often pronounced performance improvements over GPT-4o in both image-grounded reasoning and domain-specific numerical problem-solving, highlighting its potential to augment expert workflows in medical imaging and therapeutic physics.

[414] Optimizing Region of Interest Selection for Effective Embedding in Video Steganography Based on Genetic Algorithms

Nizheen A. Ali, Ramadhan J. Mstafa

Main category: eess.IV

TL;DR: A novel video steganography method using Genetic Algorithm for ROI detection and AES encryption, achieving high PSNR (64-75 dB) and fast processing for real-time applications.

DetailsMotivation: Increasing need for secure data transmission over internet requires effective steganography methods that maintain video quality while ensuring data security and privacy.

Method: Uses Genetic Algorithm to identify optimal Region of Interest (ROI) in cover video, encrypts secret data with AES standard, and embeds data using up to 10% of cover video capacity.

Result: Achieves high embedding capacity with PSNR values between 64-75 dB, indicating minimal visual distortion, and demonstrates fast encoding/decoding times suitable for real-time applications.

Conclusion: The proposed method effectively combines GA-based ROI detection with AES encryption to create a secure, efficient video steganography solution with excellent visual quality preservation and practical real-time performance.

Abstract: With the widespread use of the internet, there is an increasing need to ensure the security and privacy of transmitted data. This has led to an intensified focus on the study of video steganography, which is a technique that hides data within a video cover to avoid detection. The effectiveness of any steganography method depends on its ability to embed data without altering the original video quality while maintaining high efficiency. This paper proposes a new method to video steganography, which involves utilizing a Genetic Algorithm (GA) for identifying the Region of Interest (ROI) in the cover video. The ROI is the area in the video that is the most suitable for data embedding. The secret data is encrypted using the Advanced Encryption Standard (AES), which is a widely accepted encryption standard, before being embedded into the cover video, utilizing up to 10% of the cover video. This process ensures the security and confidentiality of the embedded data. The performance metrics for assessing the proposed method are the Peak Signal to Noise Ratio (PSNR) and the encoding and decoding time. The results show that the proposed method has a high embedding capacity and efficiency, with a PSNR ranging between 64 and 75 dBs, which indicates that the embedded data is almost indistinguishable from the original video. Additionally, the method can encode and decode data quickly, making it efficient for real time applications.

[415] PediDemi – A Pediatric Demyelinating Lesion Segmentation Dataset

Maria Popa, Gabriela Adriana Visa

Main category: eess.IV

TL;DR: First publicly available pediatric demyelinating lesion segmentation dataset with MRI scans from 13 patients, including ADEM cases, with extensive metadata and segmentation masks.

DetailsMotivation: Addressing the significant lack of publicly available datasets for pediatric demyelinating disorders beyond Multiple Sclerosis, particularly for rare conditions like ADEM.

Method: Collection of MRI scans from 13 pediatric patients with various demyelinating disorders, creation of lesion segmentation masks, and compilation of extensive patient metadata including diagnosis, treatment, and lab results.

Result: Successfully created and released the first pediatric demyelinating lesion segmentation dataset, and demonstrated its relevance by evaluating a state-of-the-art segmentation model trained on existing MS data.

Conclusion: The dataset fills a critical gap in pediatric neuroimaging research and underscores the importance of diverse datasets for improving lesion segmentation models across different demyelinating disorders.

Abstract: Demyelinating disorders of the central nervous system may have multiple causes, the most common are infections, autoimmune responses, genetic or vascular etiology. Demyelination lesions are characterized by areas were the myelin sheath of the nerve fibers are broken or destroyed. Among autoimmune disorders, Multiple Sclerosis (MS) is the most well-known Among these disorders, Multiple Sclerosis (MS) is the most well-known and aggressive form. Acute Disseminated Encephalomyelitis (ADEM) is another type of demyelinating disease, typically with a better prognosis. Magnetic Resonance Imaging (MRI) is widely used for diagnosing and monitoring disease progression by detecting lesions. While both adults and children can be affected, there is a significant lack of publicly available datasets for pediatric cases and demyelinating disorders beyond MS. This study introduces, for the first time, a publicly available pediatric dataset for demyelinating lesion segmentation. The dataset comprises MRI scans from 13 pediatric patients diagnosed with demyelinating disorders, including 3 with ADEM. In addition to lesion segmentation masks, the dataset includes extensive patient metadata, such as diagnosis, treatment, personal medical background, and laboratory results. To assess the quality of the dataset and demonstrate its relevance, we evaluate a state-of-the-art lesion segmentation model trained on an existing MS dataset. The results underscore the importance of diverse datasets

[416] Automated Cervical Cancer Detection through Visual Inspection with Acetic Acid in Resource-Poor Settings with Lightweight Deep Learning Models Deployed on an Android Device

Leander Melroy Maben, Keerthana Prasad, Shyamala Guruvare, Vidya Kudva, P C Siddalingaswamy

Main category: eess.IV

TL;DR: AI-powered automated cervical cancer screening using lightweight deep learning models for low-resource settings

DetailsMotivation: Cervical cancer claims many lives in low/middle-income countries despite being treatable. VIA screening is affordable but subjective and requires trained professionals. AI automation can eliminate subjectivity and enable task shifting to less trained workers.

Method: Proposed lightweight deep learning algorithm with EfficientDet-Lite3 for ROI detection and MobileNet-V2 for classification, deployed on Android devices for remote operation without internet or sophisticated infrastructure

Result: Classification model achieved 92.31% accuracy, 98.24% sensitivity, and 88.37% specificity on test dataset

Conclusion: The system presents a promising automated low-resource screening approach that can operate without highly-trained professionals, labs, or internet connectivity

Abstract: Cervical cancer is among the most commonly occurring cancer among women and claims a huge number of lives in low and middle-income countries despite being relatively easy to treat. Several studies have shown that public screening programs can bring down cervical cancer incidence and mortality rates significantly. While several screening tests are available, visual inspection with acetic acid (VIA) presents itself as the most viable option for low-resource settings due to the affordability and simplicity of performing the test. VIA requires a trained medical professional to interpret the test and is subjective in nature. Automating VIA using AI eliminates subjectivity and would allow shifting of the task to less trained health workers. Task shifting with AI would help further expedite screening programs in low-resource settings. In our work, we propose a lightweight deep learning algorithm that includes EfficientDet-Lite3 as the Region of Interest (ROI) detector and a MobileNet- V2 based model for classification. These models would be deployed on an android-based device that can operate remotely and provide almost instant results without the requirement of highly-trained medical professionals, labs, sophisticated infrastructure, or internet connectivity. The classification model gives an accuracy of 92.31%, a sensitivity of 98.24%, and a specificity of 88.37% on the test dataset and presents itself as a promising automated low-resource screening approach.

[417] InnerGS: Internal Scenes Rendering via Factorized 3D Gaussian Splatting

Shuxin Liang, Yihan Xiao, Wenlu Tang

Main category: eess.IV

TL;DR: A novel 3D Gaussian Splatting approach that reconstructs internal scenes by modeling continuous volumetric density through inner 3D Gaussian distributions, enabling detailed internal structure reconstruction from sparse sliced data without requiring camera poses.

DetailsMotivation: Most existing 3DGS work focuses on external surface modeling, but applications require deep understanding of object interiors, necessitating internal scene reconstruction.

Method: Directly models continuous volumetric density using inner 3D Gaussian distribution, works with sparse sliced data, eliminates need for camera poses, and is plug-and-play compatible with any data modalities.

Result: Effectively reconstructs smooth and detailed internal structures from sparse sliced data.

Conclusion: The approach provides an efficient solution for internal scene reconstruction that is camera-pose-free, versatile across data modalities, and offers detailed interior modeling capabilities.

Abstract: 3D Gaussian Splatting (3DGS) has recently gained popularity for efficient scene rendering by representing scenes as explicit sets of anisotropic 3D Gaussians. However, most existing work focuses primarily on modeling external surfaces. In this work, we target the reconstruction of internal scenes, which is crucial for applications that require a deep understanding of an object’s interior. By directly modeling a continuous volumetric density through the inner 3D Gaussian distribution, our model effectively reconstructs smooth and detailed internal structures from sparse sliced data. Our approach eliminates the need for camera poses, is plug-and-play, and is inherently compatible with any data modalities. We provide cuda implementation at: https://github.com/Shuxin-Liang/InnerGS.

[418] Susceptibility Distortion Correction of Diffusion MRI with a single Phase-Encoding Direction

Sedigheh Dargahi, Sylvain Bouix, Christian Desrosier

Main category: eess.IV

TL;DR: Deep learning approach corrects MRI distortion using single acquisition instead of requiring paired blip-up/blip-down images like traditional methods.

DetailsMotivation: Traditional susceptibility distortion correction methods require paired acquisitions (blip-up and blip-down), limiting applicability to retrospective data with single phase encoding direction.

Method: Deep learning-based approach that corrects susceptibility-induced distortions using only a single acquisition (either blip-up or blip-down).

Result: Achieves performance comparable to traditional topup method while eliminating the need for paired acquisitions.

Conclusion: Proposed method serves as an efficient and practical alternative for susceptibility distortion correction in diffusion MRI, expanding applicability to single-acquisition datasets.

Abstract: Diffusion MRI (dMRI) is a valuable tool to map brain microstructure and connectivity by analyzing water molecule diffusion in tissue. However, acquiring dMRI data requires to capture multiple 3D brain volumes in a short time, often leading to trade-offs in image quality. One challenging artifact is susceptibility-induced distortion, which introduces significant geometric and intensity deformations. Traditional correction methods, such as topup, rely on having access to blip-up and blip-down image pairs, limiting their applicability to retrospective data acquired with a single phase encoding direction. In this work, we propose a deep learning-based approach to correct susceptibility distortions using only a single acquisition (either blip-up or blip-down), eliminating the need for paired acquisitions. Experimental results show that our method achieves performance comparable to topup, demonstrating its potential as an efficient and practical alternative for susceptibility distortion correction in dMRI.

[419] Towards Understanding and Harnessing the Transferability of Prognostic Knowledge in Computational Pathology

Pei Liu, Luping Ji, Jiaxiang Gou, Xiangxiang Zeng

Main category: eess.IV

TL;DR: This paper introduces Path-PKT, the first systematic study on prognostic knowledge transfer between different cancers using whole-slide images, addressing limitations of cancer-specific models by enabling knowledge sharing across cancer types.

DetailsMotivation: Current WSI-based prognosis models are cancer-specific and cannot leverage prognostic knowledge from other cancers, limiting their ability to handle rare tumors with limited samples and benefit from generalizable knowledge across cancers.

Method: The study curates a large dataset (UNI2-h-DSS) with 13 cancers, designs experiments to understand knowledge transfer factors, and proposes MoE-PKT - a new baseline approach with routing mechanism to utilize prognostic knowledge from other cancers.

Result: The research demonstrates transferability of prognostic knowledge between different cancers and shows that source models can be effectively transferred to rare tumor diseases with limited samples.

Conclusion: Path-PKT lays solid foundations for knowledge transfer in WSI-based cancer prognosis, enabling models to benefit from cross-cancer knowledge and better handle rare tumor diseases with limited data.

Abstract: Whole-Slide Image (WSI) is an important tool for evaluating the prognosis of cancer patients. Present WSI-based prognosis studies generally follow a conventional paradigm – cancer-specific model development – where one cancer disease corresponds to one model and this model cannot make use of the prognostic knowledge from others. Despite its notable success in recent years, this paradigm has inherent limitations and has always been struggling with practical requirements: (i) scaling to the rare tumor diseases with very limited samples and (ii) benefiting from the generalizable prognostic knowledge in other cancers. To this end, this paper presents the first systematic study on Prognostic Knowledge Transfer in Pathology, called Path-PKT. It comprises three main parts. (1) We curate a large dataset (UNI2-h-DSS) with 13 cancers and use it to evaluate the transferability of prognostic knowledge between different cancers computationally. (2) We design experiments to understand what factors affect knowledge transfer and what causes positive transfers. (3) Motivated by empirical findings, we propose a new baseline approach (MoE-PKT) with a routing mechanism to utilize the generalizable prognostic knowledge in other cancers. Finally, we show the transferability of source models to rare tumor diseases. This study could lay solid foundations for the study of knowledge transfer in WSI-based cancer prognosis. Source code is available at https://github.com/liupei101/Path-PKT.

[420] State of Abdominal CT Datasets: A Critical Review of Bias, Clinical Relevance, and Real-world Applicability

Saeide Danaei, Zahra Dehghanian, Elahe Meftah, Nariman Naderi, Seyed Amir Ahmad Safavi-Naini, Faeze Khorasanizade, Hamid R. Rabiee

Main category: eess.IV

TL;DR: Systematic review of 46 public abdominal CT datasets (50,256 studies) reveals significant redundancy (59.1% case reuse), Western geographic bias (75.3% from NA/Europe), and high prevalence of domain shift (63%) and selection bias (57%) that threaten AI model generalizability.

DetailsMotivation: To evaluate the suitability of publicly available abdominal CT datasets for AI applications in clinical settings and identify biases that may undermine model performance and generalizability across diverse healthcare environments.

Method: Systematic review and critical evaluation of 46 publicly available abdominal CT datasets totaling 50,256 studies, with bias assessment performed on the 19 largest datasets (>=100 cases each).

Result: Found substantial redundancy (59.1% case reuse), Western geographic skew (75.3% from North America and Europe), and high prevalence of domain shift (63%) and selection bias (57%) in the largest datasets.

Conclusion: Proposes targeted strategies including multi-institutional collaboration, standardized protocols, and deliberate inclusion of diverse patient populations and imaging technologies to develop more equitable and clinically robust AI models for abdominal imaging.

Abstract: This systematic review critically evaluates publicly available abdominal CT datasets and their suitability for artificial intelligence (AI) applications in clinical settings. We examined 46 publicly available abdominal CT datasets (50,256 studies). Across all 46 datasets, we found substantial redundancy (59.1% case reuse) and a Western/geographic skew (75.3% from North America and Europe). A bias assessment was performed on the 19 datasets with >=100 cases; within this subset, the most prevalent high-risk categories were domain shift (63%) and selection bias (57%), both of which may undermine model generalizability across diverse healthcare environments – particularly in resource-limited settings. To address these challenges, we propose targeted strategies for dataset improvement, including multi-institutional collaboration, adoption of standardized protocols, and deliberate inclusion of diverse patient populations and imaging technologies. These efforts are crucial in supporting the development of more equitable and clinically robust AI models for abdominal imaging.

[421] subCellSAM: Zero-Shot (Sub-)Cellular Segmentation for Hit Validation in Drug Discovery

Jacob Hanimann, Daniel Siegismund, Mario Wieser, Stephan Steigele

Main category: eess.IV

TL;DR: Zero-shot cell segmentation using foundation models with self-prompting mechanism and in-context learning, eliminating need for manual tuning or fine-tuning.

DetailsMotivation: Traditional cell segmentation methods require extensive manual parameter tuning or domain-specific model fine-tuning, which is time-consuming and limits scalability in high-throughput drug discovery screening.

Method: Three-step process for nuclei, cell, and subcellular segmentation using segmentation foundation model in zero-shot setting with self-prompting mechanism that encodes morphological and topological priors using growing masks and strategically placed foreground/background points.

Result: Method accurately segments biologically relevant structures on both standard benchmarks and industry-relevant hit validation assays without dataset-specific tuning.

Conclusion: The approach provides an effective zero-shot solution for cell segmentation in high-throughput screening, eliminating the need for manual parameter optimization or model fine-tuning while maintaining accuracy.

Abstract: High-throughput screening using automated microscopes is a key driver in biopharma drug discovery, enabling the parallel evaluation of thousands of drug candidates for diseases such as cancer. Traditional image analysis and deep learning approaches have been employed to analyze these complex, large-scale datasets, with cell segmentation serving as a critical step for extracting relevant structures. However, both strategies typically require extensive manual parameter tuning or domain-specific model fine-tuning. We present a novel method that applies a segmentation foundation model in a zero-shot setting (i.e., without fine-tuning), guided by an in-context learning strategy. Our approach employs a three-step process for nuclei, cell, and subcellular segmentation, introducing a self-prompting mechanism that encodes morphological and topological priors using growing masks and strategically placed foreground/background points. We validate our method on both standard cell segmentation benchmarks and industry-relevant hit validation assays, demonstrating that it accurately segments biologically relevant structures without the need for dataset-specific tuning.

[422] Deep Biomechanically-Guided Interpolation for Keypoint-Based Brain Shift Registration

Tiago Assis, Ines P. Machado, Benjamin Zwick, Nuno C. Garcia, Reuben Dorent

Main category: eess.IV

TL;DR: A deep learning framework that uses biomechanical simulations to generate dense, physically plausible brain deformations from sparse keypoints, outperforming classical interpolation methods with half the mean square error.

DetailsMotivation: Accurate brain shift compensation is critical for neuronavigation during neurosurgery. Keypoint-based registration methods are robust but rely on simple geometric interpolators that ignore tissue biomechanics.

Method: Generate synthetic brain deformations using biomechanical simulations, then train a residual 3D U-Net to refine standard interpolation estimates into biomechanically guided deformations.

Result: Significantly outperforms classical interpolators, reducing mean square error by half while introducing negligible computational overhead at inference time.

Conclusion: The proposed deep learning framework provides biomechanically plausible brain deformation estimation that is both accurate and computationally efficient for neurosurgical applications.

Abstract: Accurate compensation of brain shift is critical for maintaining the reliability of neuronavigation during neurosurgery. While keypoint-based registration methods offer robustness to large deformations and topological changes, they typically rely on simple geometric interpolators that ignore tissue biomechanics to create dense displacement fields. In this work, we propose a novel deep learning framework that estimates dense, physically plausible brain deformations from sparse matched keypoints. We first generate a large dataset of synthetic brain deformations using biomechanical simulations. Then, a residual 3D U-Net is trained to refine standard interpolation estimates into biomechanically guided deformations. Experiments on a large set of simulated displacement fields demonstrate that our method significantly outperforms classical interpolators, reducing by half the mean square error while introducing negligible computational overhead at inference time. Code available at: \href{https://github.com/tiago-assis/Deep-Biomechanical-Interpolator}{https://github.com/tiago-assis/Deep-Biomechanical-Interpolator}.

[423] Comparing Conditional Diffusion Models for Synthesizing Contrast-Enhanced Breast MRI from Pre-Contrast Images

Sebastian Ibarra, Javier del Riego, Alessandro Catanese, Julian Cuba, Julian Cardona, Nataly Leon, Jonathan Infante, Karim Lekadir, Oliver Diaz, Richard Osuala

Main category: eess.IV

TL;DR: Diffusion models can generate synthetic contrast-enhanced breast MRI from pre-contrast images, potentially reducing need for contrast agents while maintaining diagnostic quality.

DetailsMotivation: DCE-MRI requires contrast agents that pose safety concerns, increased costs, and workflow complexity. The goal is to create synthetic contrast-enhanced images without actual contrast administration.

Method: Used pre-contrast conditioned denoising diffusion probabilistic models (22 variants) with tumor-aware loss functions and explicit tumor segmentation mask conditioning in both single-breast and full breast settings.

Result: Subtraction image-based models outperformed post-contrast-based models across 5 evaluation metrics. Tumor-aware losses and segmentation masks improved ROI evaluation. Reader study with radiologists and technologists confirmed high realism of synthetic images.

Conclusion: Generative contrast-enhancement shows emerging clinical potential for reducing contrast agent use in breast MRI while maintaining diagnostic quality, though tumor localization inputs may not always be available in screening settings.

Abstract: Dynamic contrast-enhanced (DCE) MRI is essential for breast cancer diagnosis and treatment. However, its reliance on contrast agents introduces safety concerns, contraindications, increased cost, and workflow complexity. To this end, we present pre-contrast conditioned denoising diffusion probabilistic models to synthesize DCE-MRI, introducing, evaluating, and comparing a total of 22 generative model variants in both single-breast and full breast settings. Towards enhancing lesion fidelity, we introduce both tumor-aware loss functions and explicit tumor segmentation mask conditioning. Using a public multicenter dataset and comparing to respective pre-contrast baselines, we observe that subtraction image-based models consistently outperform post-contrast-based models across five complementary evaluation metrics. Apart from assessing the entire image, we also separately evaluate the region of interest, where both tumor-aware losses and segmentation mask inputs improve evaluation metrics. The latter notably enhance qualitative results capturing contrast uptake, albeit assuming access to tumor localization inputs that are not guaranteed to be available in screening settings. A reader study involving 2 radiologists and 4 MRI technologists confirms the high realism of the synthetic images, indicating an emerging clinical potential of generative contrast-enhancement. We share our codebase at https://github.com/sebastibar/conditional-diffusion-breast-MRI.

[424] Direct vascular territory segmentation on cerebral digital subtraction angiography

P. Matthijs van der Sluijs, Lotte Strong, Frank G. te Nijenhuis, Sandra Cornelissen, Pieter Jan van Doormaal, Geert Lycklama a Nijeholt, Wim van Zwam, Ad van Es, Diederik Dippel, Aad van der Lugt, Danny Ruijters, Ruisheng Su, Theo van Walsum

Main category: eess.IV

TL;DR: Deep learning model outperforms traditional atlas method for segmenting vascular territories in cerebral DSA imaging during stroke treatment

DetailsMotivation: DSA imaging primarily shows vessels but lacks visibility of soft tissue anatomy, which could aid physicians during minimally invasive interventions for ischemic stroke treatment

Method: Trained an nnUNet model with manually segmented intracranial carotid artery and middle cerebral artery vessel territories on minimal intensity projection DSA from ischemic stroke patients

Result: Segmentation model achieved significantly higher Dice similarity coefficient (0.96 vs 0.82) and lower average surface distance (13.8 vs 47.3) compared to atlas model, with higher success rate (85% vs 66%)

Conclusion: Deep learning segmentation of vascular territories without explicit borders on cerebral DSA is superior to traditional atlas-based methods and has potential for broader application in X-ray guided medical procedures

Abstract: X-ray digital subtraction angiography (DSA) is frequently used when evaluating minimally invasive medical interventions. DSA predominantly visualizes vessels, and soft tissue anatomy is less visible or invisible in DSA. Visualization of cerebral anatomy could aid physicians during treatment. This study aimed to develop and evaluate a deep learning model to predict vascular territories that are not explicitly visible in DSA imaging acquired during ischemic stroke treatment. We trained an nnUNet model with manually segmented intracranial carotid artery and middle cerebral artery vessel territories on minimal intensity projection DSA acquired during ischemic stroke treatment. We compared the model to a traditional atlas registration model using the Dice similarity coefficient (DSC) and average surface distance (ASD). Additionally, we qualitatively assessed the success rate in both models using an external test. The segmentation model was trained on 1224 acquisitions from 361 patients with ischemic stroke. The segmentation model had a significantly higher DSC (0.96 vs 0.82, p<0.001) and lower ASD compared to the atlas model (13.8 vs 47.3, p<0.001). The success rate of the segmentation model (85%) was higher compared to the atlas registration model (66%) in the external test set. A deep learning method for the segmentation of vascular territories without explicit borders on cerebral DSA demonstrated superior accuracy and quality compared to the traditional atlas-based method. This approach has the potential to be applied to other anatomical structures for enhanced visualization during X-ray guided medical procedures. The code is publicly available at https://github.com/RuishengSu/autoTICI.

[425] Improving Deep Learning for Accelerated MRI With Data Filtering

Kang Lin, Anselm Krainovic, Kun Wang, Reinhard Heckel

Main category: eess.IV

TL;DR: Data curation strategies improve MRI reconstruction performance, with filtering training data providing consistent gains especially when in-distribution data is limited.

DetailsMotivation: Most deep learning MRI research focuses on network architectures using fixed homogeneous data, but data curation strategies could significantly impact reconstruction quality.

Method: Assembled large dataset (1.1M images from 18 sources), constructed diverse evaluation set (48 test sets), and proposed/studied different data filtering strategies for training neural networks.

Result: Filtering training data leads to consistent, albeit modest, performance gains across different training set sizes and accelerations, with particular benefit when in-distribution data proportion is low.

Conclusion: Data curation through filtering is an effective strategy for improving MRI reconstruction performance, complementing architectural improvements in neural networks.

Abstract: Deep neural networks achieve state-of-the-art results for accelerated MRI reconstruction. Most research on deep learning based imaging focuses on improving neural network architectures trained and evaluated on fixed and homogeneous training and evaluation data. In this work, we investigate data curation strategies for improving MRI reconstruction. We assemble a large dataset of raw k-space data from 18 public sources consisting of 1.1M images and construct a diverse evaluation set comprising 48 test sets, capturing variations in anatomy, contrast, number of coils, and other key factors. We propose and study different data filtering strategies to enhance performance of current state-of-the-art neural networks for accelerated MRI reconstruction. Our experiments show that filtering the training data leads to consistent, albeit modest, performance gains. These performance gains are robust across different training set sizes and accelerations, and we find that filtering is particularly beneficial when the proportion of in-distribution data in the unfiltered training set is low.

[426] Latent Interpolation Learning Using Diffusion Models for Cardiac Volume Reconstruction

Niklas Bubeck, Suprosanna Shit, Chen Chen, Can Zhao, Pengfei Guo, Dong Yang, Georg Zitzlsberger, Daguang Xu, Bernhard Kainz, Daniel Rueckert, Jiazhen Pan

Main category: eess.IV

TL;DR: CaLID is a novel diffusion-based framework for 3D cardiac reconstruction from sparse 2D CMR slices that eliminates need for auxiliary inputs, achieves 24x speedup, and delivers state-of-the-art performance.

DetailsMotivation: Current cardiac MRI reconstruction methods are limited by sparse 2D slice acquisition, reliance on predefined interpolation schemes, computational inefficiency, and dependence on additional semantic inputs like segmentation labels.

Method: A cardiac latent interpolation diffusion (CaLID) framework using data-driven diffusion models in latent space, with three innovations: diffusion-based interpolation, latent space operation for efficiency, and extension to 2D+T spatiotemporal modeling.

Result: Achieves 24x faster 3D whole-heart upsampling, SOTA performance without auxiliary inputs, superior reconstruction quality in volumetric evaluations, and effective spatiotemporal dynamics modeling.

Conclusion: CaLID advances spatio and spatiotemporal whole-heart reconstruction, offering a robust and clinically practical solution that addresses fundamental limitations of existing cardiac imaging approaches.

Abstract: Cardiac Magnetic Resonance (CMR) imaging is a critical tool for diagnosing and managing cardiovascular disease, yet its utility is often limited by the sparse acquisition of 2D short-axis slices, resulting in incomplete volumetric information. Accurate 3D reconstruction from these sparse slices is essential for comprehensive cardiac assessment, but existing methods face challenges, including reliance on predefined interpolation schemes (e.g., linear or spherical), computational inefficiency, and dependence on additional semantic inputs such as segmentation labels or motion data. To address these limitations, we propose a novel \textbf{Ca}rdiac \textbf{L}atent \textbf{I}nterpolation \textbf{D}iffusion (CaLID) framework that introduces three key innovations. First, we present a data-driven interpolation scheme based on diffusion models, which can capture complex, non-linear relationships between sparse slices and improves reconstruction accuracy. Second, we design a computationally efficient method that operates in the latent space and speeds up 3D whole-heart upsampling time by a factor of 24, reducing computational overhead compared to previous methods. Third, with only sparse 2D CMR images as input, our method achieves SOTA performance against baseline methods, eliminating the need for auxiliary input such as morphological guidance, thus simplifying workflows. We further extend our method to 2D+T data, enabling the effective modeling of spatiotemporal dynamics and ensuring temporal coherence. Extensive volumetric evaluations and downstream segmentation tasks demonstrate that CaLID achieves superior reconstruction quality and efficiency. By addressing the fundamental limitations of existing approaches, our framework advances the state of the art for spatio and spatiotemporal whole-heart reconstruction, offering a robust and clinically practical solution for cardiovascular imaging.

[427] A Novel Attention-Augmented Wavelet YOLO System for Real-time Brain Vessel Segmentation on Transcranial Color-coded Doppler

Wenxuan Zhang, Shuai Li, Xinyi Wang, Yu Sun, Hongyu Kang, Pui Yuk Chryste Wan, Yong-Ping Zheng, Sai-Kit Lam

Main category: eess.IV

TL;DR: AI-powered real-time Circle of Willis segmentation system using novel AAW-YOLO network achieves high accuracy (Dice 0.901) for TCCD imaging, reducing operator dependence in cerebrovascular screening.

DetailsMotivation: TCCD imaging offers radiation-free, affordable brain vessel assessment but requires expert operators for reliable interpretation, limiting widespread adoption. AI automation can address this limitation.

Method: Developed Attention-Augmented Wavelet YOLO (AAW-YOLO) network specifically for TCCD data, trained on 738 annotated frames with 3,419 labeled artery instances for real-time cerebrovascular segmentation.

Result: Achieved excellent performance: Dice 0.901, IoU 0.823, precision 0.882, recall 0.926, mAP 0.953 with fast inference speed of 14.199 ms per frame for both ipsilateral and contralateral vessel segmentation.

Conclusion: The AI system provides practical solution to reduce operator dependence in TCCD-based cerebrovascular screening, with potential for clinical workflow integration and resource-constrained settings. Future work includes bilateral modeling and larger validation.

Abstract: The Circle of Willis (CoW), vital for ensuring consistent blood flow to the brain, is closely linked to ischemic stroke. Accurate assessment of the CoW is important for identifying individuals at risk and guiding appropriate clinical management. Among existing imaging methods, Transcranial Color-coded Doppler (TCCD) offers unique advantages due to its radiation-free nature, affordability, and accessibility. However, reliable TCCD assessments depend heavily on operator expertise for identifying anatomical landmarks and performing accurate angle correction, which limits its widespread adoption. To address this challenge, we propose an AI-powered, real-time CoW auto-segmentation system capable of efficiently capturing cerebral arteries. No prior studies have explored AI-driven cerebrovascular segmentation using TCCD. In this work, we introduce a novel Attention-Augmented Wavelet YOLO (AAW-YOLO) network tailored for TCCD data, designed to provide real-time guidance for brain vessel segmentation in the CoW. We prospectively collected TCCD data comprising 738 annotated frames and 3,419 labeled artery instances to establish a high-quality dataset for model training and evaluation. The proposed AAW-YOLO demonstrated strong performance in segmenting both ipsilateral and contralateral CoW vessels, achieving an average Dice score of 0.901, IoU of 0.823, precision of 0.882, recall of 0.926, and mAP of 0.953, with a per-frame inference speed of 14.199 ms. This system offers a practical solution to reduce reliance on operator experience in TCCD-based cerebrovascular screening, with potential applications in routine clinical workflows and resource-constrained settings. Future research will explore bilateral modeling and larger-scale validation.

[428] Learning to See Through Flare

Xiaopeng Peng, Heath Gemar, Erin Fleet, Kyle Novak, Abbie Watnik, Grover Swartzlander

Main category: eess.IV

TL;DR: NeuSee is a computational imaging framework that uses neural networks and diffractive optics to protect camera sensors from laser damage while maintaining image quality across the full visible spectrum.

DetailsMotivation: Machine vision systems are vulnerable to laser flare that can blind or permanently damage sensors through oversaturation. Existing protection methods are limited and don't provide full-spectrum coverage.

Method: Jointly learns a neural representation of a diffractive optical element (DOE) and a frequency-space Mamba-GAN network for image restoration. Uses adversarial end-to-end training on 100K images with heterogeneous data parallelism.

Result: Can suppress laser irradiance up to 10^6 times the sensor saturation threshold while maintaining image quality. Achieves 10.1% improvement in restored image quality over other learned DOEs.

Conclusion: NeuSee is the first framework to achieve full-spectrum imaging with laser suppression, handling dynamic laser conditions, lens flare, ambient lighting variations, and sensor noise effectively.

Abstract: Machine vision systems are susceptible to laser flare, where unwanted intense laser illumination blinds and distorts its perception of the environment through oversaturation or permanent damage to sensor pixels. We introduce NeuSee, the first computational imaging framework for high-fidelity sensor protection across the full visible spectrum. It jointly learns a neural representation of a diffractive optical element (DOE) and a frequency-space Mamba-GAN network for image restoration. NeuSee system is adversarially trained end-to-end on 100K unique images to suppress the peak laser irradiance as high as $10^6$ times the sensor saturation threshold $I_{\textrm{sat}}$, the point at which camera sensors may experience damage without the DOE. Our system leverages heterogeneous data and model parallelism for distributed computing, integrating hyperspectral information and multiple neural networks for realistic simulation and image restoration. NeuSee takes into account open-world scenes with dynamically varying laser wavelengths, intensities, and positions, as well as lens flare effects, unknown ambient lighting conditions, and sensor noises. It outperforms other learned DOEs, achieving full-spectrum imaging and laser suppression for the first time, with a 10.1% improvement in restored image quality.

[429] MMIS-Net for Retinal Fluid Segmentation and Detection

Nchongmaje Ndipenocha, Alina Mirona, Kezhi Wanga, Yongmin Li

Main category: eess.IV

TL;DR: MMIS-Net leverages multiple medical image datasets across modalities and organs using Similarity Fusion blocks and one-hot label space to achieve state-of-the-art segmentation performance on unseen data.

DetailsMotivation: Most deep learning methods use single-source data, overlooking the potential of combining multiple annotated medical image datasets from different modalities, organs, and diseases to improve generalization on unseen data.

Method: Proposed MMIS-Net with Similarity Fusion blocks for supervision and pixel-wise similarity knowledge selection, plus a one-hot label space to handle inconsistent class definitions and label contradictions across datasets.

Result: Achieved best mean Dice score of 0.83 and absolute volume difference of 0.035 for fluids segmentation, and perfect AUC of 1 for fluid detection on RETOUCH challenge, outperforming foundation models and state-of-the-art methods.

Conclusion: The combination of Similarity Fusion blocks for feature fusion and one-hot label space for handling class inconsistencies effectively leverages multiple datasets to achieve superior segmentation performance on unseen medical images.

Abstract: Purpose: Deep learning methods have shown promising results in the segmentation, and detection of diseases in medical images. However, most methods are trained and tested on data from a single source, modality, organ, or disease type, overlooking the combined potential of other available annotated data. Numerous small annotated medical image datasets from various modalities, organs, and diseases are publicly available. In this work, we aim to leverage the synergistic potential of these datasets to improve performance on unseen data. Approach: To this end, we propose a novel algorithm called MMIS-Net (MultiModal Medical Image Segmentation Network), which features Similarity Fusion blocks that utilize supervision and pixel-wise similarity knowledge selection for feature map fusion. Additionally, to address inconsistent class definitions and label contradictions, we created a one-hot label space to handle classes absent in one dataset but annotated in another. MMIS-Net was trained on 10 datasets encompassing 19 organs across 2 modalities to build a single model. Results: The algorithm was evaluated on the RETOUCH grand challenge hidden test set, outperforming large foundation models for medical image segmentation and other state-of-the-art algorithms. We achieved the best mean Dice score of 0.83 and an absolute volume difference of 0.035 for the fluids segmentation task, as well as a perfect Area Under the Curve of 1 for the fluid detection task. Conclusion: The quantitative results highlight the effectiveness of our proposed model due to the incorporation of Similarity Fusion blocks into the network’s backbone for supervision and similarity knowledge selection, and the use of a one-hot label space to address label class inconsistencies and contradictions.

[430] Real-Time, Population-Based Reconstruction of 3D Bone Models via Very-Low-Dose Protocols

Yiqun Lin, Haoran Sun, Yongqing Li, Rabia Aslam, Lung Fung Tse, Tiange Cheng, Chun Sing Chui, Wing Fung Yau, Victorine R. Le Meur, Meruyert Amangeldy, Kiho Cho, Yinyu Ye, James Zou, Wei Zhao, Xiaomeng Li

Main category: eess.IV

TL;DR: SSR-KD is an AI framework that reconstructs high-quality bone models from biplanar X-rays in 30 seconds with <1.0mm error, eliminating CT dependency and manual work while enabling intraoperative use.

DetailsMotivation: Traditional CT-based bone modeling has limitations including radiation exposure, time-consuming manual delineation, and inability for intraoperative use, creating need for faster, safer alternatives.

Method: Semi-Supervised Reconstruction with Knowledge Distillation (SSR-KD) framework that uses biplanar X-rays instead of CT scans to reconstruct 3D bone models through AI-driven processing.

Result: Achieves reconstruction in 30 seconds with average error under 1.0mm; expert osteotomy simulations showed comparable clinical applicability to CT-based models.

Conclusion: The approach accelerates bone modeling, reduces radiation exposure, enables intraoperative guidance, and significantly improves practicality for orthopedic applications.

Abstract: Patient-specific bone models are essential for designing surgical guides and preoperative planning, as they enable the visualization of intricate anatomical structures. However, traditional CT-based approaches for creating bone models are limited to preoperative use due to the low flexibility and high radiation exposure of CT and time-consuming manual delineation. Here, we introduce Semi-Supervised Reconstruction with Knowledge Distillation (SSR-KD), a fast and accurate AI framework to reconstruct high-quality bone models from biplanar X-rays in 30 seconds, with an average error under 1.0 mm, eliminating the dependence on CT and manual work. Additionally, high tibial osteotomy simulation was performed by experts on reconstructed bone models, demonstrating that bone models reconstructed from biplanar X-rays have comparable clinical applicability to those annotated from CT. Overall, our approach accelerates the process, reduces radiation exposure, enables intraoperative guidance, and significantly improves the practicality of bone models, offering transformative applications in orthopedics.

[431] UNICON: UNIfied CONtinual Learning for Medical Foundational Models

Mohammad Areeb Qazi, Munachiso S Nwadike, Ibrahim Almakky, Mohammad Yaqub, Numan Saeed

Main category: eess.IV

TL;DR: UNICON enables medical foundation models to continually adapt to new domains, tasks, and modalities without catastrophic forgetting, achieving improved performance across multiple medical imaging tasks.

DetailsMotivation: Medical imaging faces data scarcity challenges for pre-training models for every domain/modality/task. Continual learning offers a solution by enabling models to integrate new knowledge without requiring large datasets for each training phase.

Method: UNICON framework provides unified, perpetually expandable continual learning that allows foundation models to dynamically expand across imaging modalities, anatomical regions, and clinical objectives through careful integration.

Result: Adapted chest CT foundation model to prognosis and segmentation tasks with improved performance. Continually incorporated PET scans achieving 5% improvement in Dice score compared to baselines.

Conclusion: Foundation models are not inherently constrained to initial training scope and can evolve, paving the way toward generalist AI models for medical imaging through unified continual learning approaches.

Abstract: Foundational models are trained on extensive datasets to capture the general trends of a domain. However, in medical imaging, the scarcity of data makes pre-training for every domain, modality, or task challenging. Continual learning offers a solution by fine-tuning a model sequentially on different domains or tasks, enabling it to integrate new knowledge without requiring large datasets for each training phase. In this paper, we propose UNIfied CONtinual Learning for Medical Foundational Models (UNICON), a framework that enables the seamless adaptation of foundation models to diverse domains, tasks, and modalities. Unlike conventional adaptation methods that treat these changes in isolation, UNICON provides a unified, perpetually expandable framework. Through careful integration, we show that foundation models can dynamically expand across imaging modalities, anatomical regions, and clinical objectives without catastrophic forgetting or task interference. Empirically, we validate our approach by adapting a chest CT foundation model initially trained for classification to a prognosis and segmentation task. Our results show improved performance across both additional tasks. Furthermore, we continually incorporated PET scans and achieved a 5% improvement in Dice score compared to respective baselines. These findings establish that foundation models are not inherently constrained to their initial training scope but can evolve, paving the way toward generalist AI models for medical imaging.

[432] Regional quality estimation for echocardiography using deep learning

Gilles Van De Vyver, Svein-Erik Måsøy, Håvard Dalen, Bjørnar Leangen Grenne, Espen Holte, Sindre Hellum Olaisen, John Nyberg, Andreas Østvik, Lasse Løvstakken, Erik Smistad

Main category: eess.IV

TL;DR: Three methods for automatic cardiac ultrasound image quality assessment were compared: pixel-based metrics, local coherence analysis, and end-to-end deep learning. The deep learning approach achieved the best performance (rho=0.69), comparable to inter-observer correlation.

DetailsMotivation: Previous cardiac ultrasound quality assessment methods fail to distinguish view correctness from image quality and only provide global quality scores, limiting practical utility. Regional quality assessment is needed for clinical applications.

Method: Three approaches: 1) Classic pixel-based metrics (gCNR) using U-Net segmentation, 2) Local image coherence from U-Net predictions, 3) End-to-end deep convolutional network for regional quality prediction.

Result: gCNR performed poorly (rho=0.24). End-to-end learning achieved best results (rho=0.69), comparable to inter-observer correlation (rho=0.63). Coherence-based method performed well (rho=0.58) and is more generic.

Conclusion: Deep learning approaches outperform traditional metrics for regional image quality assessment. The end-to-end model provides best accuracy while coherence method offers better generalization. Tool available as open-source Python library.

Abstract: Automatic estimation of cardiac ultrasound image quality can be beneficial for guiding operators and ensuring the accuracy of clinical measurements. Previous work often fails to distinguish the view correctness of the echocardiogram from the image quality. Additionally, previous studies only provide a global image quality value, which limits their practical utility. In this work, we developed and compared three methods to estimate image quality:

  1. classic pixel-based metrics like the generalized contrast-to-noise ratio (gCNR) on myocardial segments as region of interest and left ventricle lumen as background, obtained using a U-Net segmentation 2) local image coherence derived from a U-Net model that predicts coherence from B-Mode images 3) a deep convolutional network that predicts the quality of each region directly in an end-to-end fashion. We evaluate each method against manual regional image quality annotations by three experienced cardiologists. The results indicate poor performance of the gCNR metric, with Spearman correlation to the annotations of rho = 0.24. The end-to-end learning model obtains the best result, rho = 0.69, comparable to the inter-observer correlation, rho = 0.63. Finally, the coherence-based method, with rho = 0.58, outperformed the classical metrics and is more generic than the end-to-end approach. The image quality prediction tool is available as an open source Python library at https://github.com/GillesVanDeVyver/arqee.

[433] MedVisionLlama: Leveraging Pre-Trained Large Language Model Layers to Enhance Medical Image Segmentation

Gurucharan Marthi Krishna Kumar, Aman Chadha, Janine Mendola, Amir Shmuel

Main category: eess.IV

TL;DR: Integrating frozen pre-trained LLM transformer blocks into Vision Transformers significantly improves medical image segmentation performance across various modalities.

DetailsMotivation: To enhance medical image segmentation accuracy by leveraging the versatile capabilities of Large Language Models (LLMs) in combination with Vision Transformers, as accurate segmentation is crucial for diagnostic imaging.

Method: Proposed a hybrid approach that incorporates frozen pre-trained LLM transformer blocks into ViT encoder, combined with Hybrid Attention Mechanism for global/local feature learning and Multi-Scale Fusion Block for feature aggregation across scales.

Result: Substantial performance improvements including average Dice score increase from 0.74 to 0.79, along with gains in accuracy, precision, and Jaccard Index across various medical imaging modalities.

Conclusion: LLM-based transformers are highly effective for refining medical image segmentation, demonstrating significant potential to boost model accuracy and robustness in diagnostic applications.

Abstract: Large Language Models (LLMs), known for their versatility in textual data, are increasingly being explored for their potential to enhance medical image segmentation, a crucial task for accurate diagnostic imaging. This study explores enhancing Vision Transformers (ViTs) for medical image segmentation by integrating pre-trained LLM transformer blocks. Our approach, which incorporates a frozen LLM transformer block into the encoder of a ViT-based model, leads to substantial improvements in segmentation performance across various medical imaging modalities. We propose a Hybrid Attention Mechanism that combines global and local feature learning with a Multi-Scale Fusion Block for aggregating features across different scales. The enhanced model shows significant performance gains, including an average Dice score increase from 0.74 to 0.79 and improvements in accuracy, precision, and the Jaccard Index. These results demonstrate the effectiveness of LLM-based transformers in refining medical image segmentation, highlighting their potential to significantly boost model accuracy and robustness. The source code and our implementation are available at: https://github.com/AS-Lab/Marthi-et-al-2025-MedVisionLlama-Pre-Trained-LLM-Layers-to-Enhance-Medical-Image-Segmentation

[434] RadGPT: Constructing 3D Image-Text Tumor Datasets

Pedro R. A. S. Bassi, Mehmet Can Yavuz, Kang Wang, Xiaoxi Chen, Wenxuan Li, Sergio Decherchi, Andrea Cavalli, Yang Yang, Alan Yuille, Zongwei Zhou

Main category: eess.IV

TL;DR: AbdomenAtlas 3.0 is the first public abdominal CT dataset with expert-reviewed radiology reports paired with tumor masks, addressing the lack of detailed reports in existing datasets for AI development.

DetailsMotivation: Public CT datasets often lack detailed radiology reports, limiting their usefulness for developing accurate AI report generation systems for cancer detection.

Method: Created AbdomenAtlas 3.0 with 9,262 CT-mask-report triplets using RadGPT framework, where radiologists converted revised tumor segmentation masks into structured reports. Expanded tumor masks by 4.2x across 17 public datasets.

Result: Dataset contains 3,955 tumor cases with standardized reports including tumor size, location, attenuation and surgical resectability. Segmentation-assisted methods significantly improved tumor detection in AI-generated reports compared to state-of-the-art models.

Conclusion: Segmentation strongly enhances tumor detection in AI report generation, and RadGPT serves as both a dataset creation tool and a fully-automatic segmentation-assisted report generation method.

Abstract: Cancers identified in CT scans are usually accompanied by detailed radiology reports, but publicly available CT datasets often lack these essential reports. This absence limits their usefulness for developing accurate report generation AI. To address this gap, we present AbdomenAtlas 3.0, the first public, high-quality abdominal CT dataset with detailed, expert-reviewed radiology reports. All reports are paired with per-voxel masks and they describe liver, kidney and pancreatic tumors. AbdomenAtlas 3.0 has 9,262 triplets of CT, mask and report–3,955 with tumors. These CT scans come from 17 public datasets. Besides creating the reports for these datasets, we expanded their number of tumor masks by 4.2x, identifying 3,011 new tumor cases. Notably, the reports in AbdomenAtlas 3.0 are more standardized, and generated faster than traditional human-made reports. They provide details like tumor size, location, attenuation and surgical resectability. These reports were created by 12 board-certified radiologists using our proposed RadGPT, a novel framework that converted radiologist-revised tumor segmentation masks into structured and narrative reports. Besides being a dataset creation tool, RadGPT can also become a fully-automatic, segmentation-assisted report generation method. We benchmarked this method and 5 state-of-the-art report generation vision-language models. Our results show that segmentation strongly improves tumor detection in AI-made reports.

[435] BRISC: Annotated Dataset for Brain Tumor Segmentation and Classification with Swin-HAFNet

Amirreza Fateh, Yasin Rezvani, Sara Moayedi, Sadjad Rezvani, Fatemeh Fateh, Mansoor Fateh

Main category: eess.IV

TL;DR: New BRISC MRI dataset with 6,000 annotated brain tumor scans and transformer-based model achieves 82.3% IoU for segmentation and 99.63% accuracy for classification across four diagnostic categories.

DetailsMotivation: Lack of high-quality, balanced, and diverse datasets for brain tumor segmentation and classification from MRI scans hinders medical image analysis progress.

Method: Created BRISC dataset with 6,000 contrast-enhanced T1-weighted MRI scans annotated by radiologists, and developed a transformer-based model using Swin Transformer backbone for multi-scale feature representations.

Result: Proposed model achieved 82.3% weighted mean IoU for segmentation and 99.63% accuracy for classification across glioma, meningioma, pituitary, and non-tumorous cases.

Conclusion: The BRISC dataset and transformer-based approach significantly advance brain tumor analysis capabilities, demonstrating superior performance in both segmentation and classification tasks.

Abstract: Accurate segmentation and classification of brain tumors from Magnetic Resonance Imaging (MRI) remain key challenges in medical image analysis. This is primarily due to the lack of high-quality, balanced, and diverse datasets. In this work, we present a newly developed MRI dataset named BRISC designed specifically for brain tumor segmentation and classification tasks. The dataset comprises 6,000 contrast-enhanced T1-weighted MRI scans annotated by certified radiologists and physicians. It includes three major tumor types, namely glioma, meningioma, and pituitary, as well as non-tumorous cases. Each sample includes high-resolution labels and is categorized across axial, sagittal, and coronal imaging planes to facilitate robust model development and cross-view generalization. To demonstrate the utility of the dataset, we propose a transformer-based segmentation model and benchmark it against established baselines. In this work, we propose a transformer-based model designed for both segmentation and classification of brain tumors, leveraging multi-scale feature representations from a Swin Transformer backbone. The model is benchmarked against established baselines to demonstrate the utility of the dataset, enabling accurate segmentation and robust classification across four diagnostic categories: glioma, meningioma, pituitary, and non-tumorous cases. In this work, our proposed transformer-based model demonstrates superior performance in both segmentation and classification tasks for brain tumor analysis. For the segmentation task, the method achieves the highest weighted mean Intersection-over-Union (IoU) of 82.3%, with improvements observed across all tumor categories. For the classification task, the model attains an accuracy of 99.63%, effectively distinguishing between glioma, meningioma, pituitary, and non-tumorous cases. https://www.kaggle.com/datasets/briscdataset/brisc2025/

[436] UltraDfeGAN: Detail-Enhancing Generative Adversarial Networks for High-Fidelity Functional Ultrasound Synthesis

Zhuo Li, Xuhang Chen, Shuqiang Wang, Bin Yuan, Nou Sotheany, Ngeth Rithea

Main category: eess.IV

TL;DR: GAN-based framework for generating realistic functional ultrasound (fUS) neuroimaging data to address data scarcity issues in clinical applications.

DetailsMotivation: Functional ultrasound faces challenges with data scarcity and limitations in generating realistic images, hindering its clinical applications in neonatal monitoring and intraoperative guidance.

Method: Proposes a generative adversarial network (GAN) framework with architectural enhancements including feature enhancement modules and normalization techniques to improve image fidelity and physiological plausibility.

Result: The framework outperforms existing generative models, producing high-quality fUS images under various conditions and improving classification accuracy when used for data augmentation in downstream tasks.

Conclusion: The GAN-based approach effectively addresses fUS data limitations and shows promise for enhancing clinical applications through realistic synthetic image generation.

Abstract: Functional ultrasound (fUS) is a neuroimaging technique known for its high spatiotemporal resolution, enabling non-invasive observation of brain activity through neurovascular coupling. Despite its potential in clinical applications such as neonatal monitoring and intraoperative guidance, the development of fUS faces challenges related to data scarcity and limitations in generating realistic fUS images. This paper explores the use of a generative adversarial network (GAN) framework tailored for fUS image synthesis. The proposed method incorporates architectural enhancements, including feature enhancement modules and normalization techniques, aiming to improve the fidelity and physiological plausibility of generated images. The study evaluates the performance of the framework against existing generative models, demonstrating its capability to produce high-quality fUS images under various experimental conditions. Additionally, the synthesized images are assessed for their utility in downstream tasks, showing improvements in classification accuracy when used for data augmentation. Experimental results are based on publicly available fUS datasets, highlighting the framework’s effectiveness in addressing data limitations.

[437] A Versatile Pathology Co-pilot via Reasoning Enhanced Multimodal Large Language Model

Zhe Xu, Ziyi Liu, Junlin Hou, Jiabo Ma, Cheng Jin, Yihui Wang, Zhixuan Chen, Zhengyu Zhang, Fuxiang Huang, Zhengrui Guo, Fengtao Zhou, Yingxue Xu, Xi Wang, Ronald Cheong Kin Chan, Li Liang, Hao Chen

Main category: eess.IV

TL;DR: SmartPath-R1 is a multimodal LLM for pathology that handles both ROI-level and WSI-level tasks without requiring chain-of-thought annotations, using scale-dependent fine-tuning and reinforcement learning to achieve robust reasoning across 72 different pathology tasks.

DetailsMotivation: Current MLLMs in pathology have limited reasoning capabilities due to expensive chain-of-thought annotations and are restricted to simple ROI-level VQA tasks, failing to address the full spectrum of clinical diagnostic needs including classification, detection, segmentation, and WSI-level analysis.

Method: Combines scale-dependent supervised fine-tuning and task-aware reinforcement fine-tuning to leverage intrinsic MLLM knowledge without chain-of-thought supervision. Uses mixture-of-experts mechanism for multiscale and multitask analysis. Trained on large-scale dataset with 2.3M ROI samples and 188K WSI samples.

Result: Extensive experiments across 72 tasks validate the effectiveness and superiority of the approach. The model demonstrates robust pathological reasoning capability for both ROI-level and WSI-level tasks simultaneously.

Conclusion: SmartPath-R1 represents a significant step toward developing versatile, reasoning-enhanced AI systems for precision pathology by overcoming limitations of current MLLM approaches and enabling comprehensive diagnostic analysis.

Abstract: Multimodal large language models (MLLMs) have emerged as powerful tools for computational pathology, offering unprecedented opportunities to integrate pathological images with language context for comprehensive diagnostic analysis. These models hold particular promise for automating complex tasks that traditionally require expert interpretation of pathologists. However, current MLLM approaches in pathology demonstrate significantly constrained reasoning capabilities, primarily due to their reliance on expensive chain-of-thought annotations. Additionally, existing methods remain limited to simplex application of visual question answering (VQA) at the region-of-interest (ROI) level, failing to address the full spectrum of diagnostic needs such as ROI classification, detection, segmentation, whole-slide-image (WSI) classification and VQA in clinical practice. In this study, we present SmartPath-R1, a versatile MLLM capable of simultaneously addressing both ROI-level and WSI-level tasks while demonstrating robust pathological reasoning capability. Our framework combines scale-dependent supervised fine-tuning and task-aware reinforcement fine-tuning, which circumvents the requirement for chain-of-thought supervision by leveraging the intrinsic knowledge within MLLM. Furthermore, SmartPath-R1 integrates multiscale and multitask analysis through a mixture-of-experts mechanism, enabling dynamic processing for diverse tasks. We curate a large-scale dataset comprising 2.3M ROI samples and 188K WSI samples for training and evaluation. Extensive experiments across 72 tasks validate the effectiveness and superiority of the proposed approach. This work represents a significant step toward developing versatile, reasoning-enhanced AI systems for precision pathology.

[438] Uncertainty-Aware Learning Policy for Reliable Pulmonary Nodule Detection on Chest X-Ray

Hyeonjin Choi, Jinse Kim, Dong-yeon Yoo, Ju-sung Sun, Jung-won Lee

Main category: eess.IV

TL;DR: Proposes Uncertainty-Aware Learning Policy to improve medical AI’s diagnostic accuracy by learning physicians’ background knowledge alongside chest X-ray lesion data, achieving 92% performance with 10% sensitivity improvement and reduced uncertainty.

DetailsMotivation: Physicians' trust in medical AI is limited due to diagnostic uncertainty concerns. AI lacks physicians' extensive background knowledge and clinical experience, relying only on repetitive lesion learning, leading to knowledge deficiency and uncertainty.

Method: Uncertainty-Aware Learning Policy that learns physicians’ background knowledge in addition to chest X-ray lesion information. Used 2,517 lesion-free and 656 nodule images from Ajou University Hospital.

Result: Model achieved 92% performance (IoU 0.2 / FPPI 2) with 10% enhancement in sensitivity compared to baseline. Reduced entropy (uncertainty measure) by 0.2.

Conclusion: The proposed approach successfully addresses medical AI’s knowledge deficiency by incorporating physicians’ background knowledge, improving diagnostic accuracy while reducing uncertainty, which could enhance physician trust and clinical adoption.

Abstract: Early detection and rapid intervention of lung cancer are crucial. Nonetheless, ensuring an accurate diagnosis is challenging, as physicians’ ability to interpret chest X-rays varies significantly depending on their experience and degree of fatigue. Although medical AI has been rapidly advancing to assist in diagnosis, physicians’ trust in such systems remains limited, preventing widespread clinical adoption. This skepticism fundamentally stems from concerns about its diagnostic uncertainty. In clinical diagnosis, physicians utilize extensive background knowledge and clinical experience. In contrast, medical AI primarily relies on repetitive learning of the target lesion to generate diagnoses based solely on that data. In other words, medical AI does not possess sufficient knowledge to render a diagnosis, leading to diagnostic uncertainty. Thus, this study suggests an Uncertainty-Aware Learning Policy that can address the issue of knowledge deficiency by learning the physicians’ background knowledge alongside the Chest X-ray lesion information. We used 2,517 lesion-free images and 656 nodule images, all obtained from Ajou University Hospital. The proposed model attained 92% (IoU 0.2 / FPPI 2) with a 10% enhancement in sensitivity compared to the baseline model while also decreasing entropy as a measure of uncertainty by 0.2.

Last updated: 2025-08-22
Built with Hugo, theme modified on Stack